Quote:
Originally posted by Charter
If these encodings do use the characters bewteen ~ and _ then you might try the following.
PHP Code:
$phpdig_words_chars['EUC-JP'] = '[:alnum:]€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûüý';
$phpdig_words_chars['Shift_JIS'] = '[:alnum:]€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúû';
|
Thank you Charter!
These encodings actually use characters from ~ to _.
This is how I built these strings:
EUC-JP uses character from ASCII 64 to 254 except 127 and 142.
Shift_JIS uses character from ASCII 64 to 252 except 127.
These characters are either in first or second position in multibyte characters.
What do you mean by "bad character"?
BTW, I found the full version of Jcode (the previous one was a Light Edition) at
http://www.spencernetwork.org/jcode/.
It has support for UTF-8, but I haven't been able to make it work. Each characters are replaced by ? in the source. I think it must come from an incompatibility in the server. That doesn't work locally neither (on an apache emulator)
Apparently, all what I need to do to index multi-byte words is to change phpdigEpureText for the Japanese encodings to replace the separators by a space and to convert the text to the good encoding.
I haven't tried to implement it in phpdig though, but phpdigEpureText was why no word was indexed in the Japanese pages. I tested this function alone and it returned less than 2 letter words for a Japanese input.