View Single Post
Old 01-12-2004, 01:44 PM   #17
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Quote:
Originally posted by Charter
If these encodings do use the characters bewteen ~ and _ then you might try the following.
PHP Code:
$phpdig_words_chars['EUC-JP'] = '[:alnum:]€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûüý';

$phpdig_words_chars['Shift_JIS'] = '[:alnum:]€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúû'
Thank you Charter!

These encodings actually use characters from ~ to _.
This is how I built these strings:
EUC-JP uses character from ASCII 64 to 254 except 127 and 142.
Shift_JIS uses character from ASCII 64 to 252 except 127.
These characters are either in first or second position in multibyte characters.

What do you mean by "bad character"?

BTW, I found the full version of Jcode (the previous one was a Light Edition) at http://www.spencernetwork.org/jcode/.
It has support for UTF-8, but I haven't been able to make it work. Each characters are replaced by ? in the source. I think it must come from an incompatibility in the server. That doesn't work locally neither (on an apache emulator)

Apparently, all what I need to do to index multi-byte words is to change phpdigEpureText for the Japanese encodings to replace the separators by a space and to convert the text to the good encoding.
I haven't tried to implement it in phpdig though, but phpdigEpureText was why no word was indexed in the Japanese pages. I tested this function alone and it returned less than 2 letter words for a Japanese input.

Last edited by Edomondo; 01-12-2004 at 01:49 PM.
Edomondo is offline   Reply With Quote