|
01-11-2004, 01:48 PM | #16 | |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Quote:
PHP Code:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
01-12-2004, 01:44 PM | #17 | |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
Quote:
These encodings actually use characters from ~ to _. This is how I built these strings: EUC-JP uses character from ASCII 64 to 254 except 127 and 142. Shift_JIS uses character from ASCII 64 to 252 except 127. These characters are either in first or second position in multibyte characters. What do you mean by "bad character"? BTW, I found the full version of Jcode (the previous one was a Light Edition) at http://www.spencernetwork.org/jcode/. It has support for UTF-8, but I haven't been able to make it work. Each characters are replaced by ? in the source. I think it must come from an incompatibility in the server. That doesn't work locally neither (on an apache emulator) Apparently, all what I need to do to index multi-byte words is to change phpdigEpureText for the Japanese encodings to replace the separators by a space and to convert the text to the good encoding. I haven't tried to implement it in phpdig though, but phpdigEpureText was why no word was indexed in the Japanese pages. I tested this function alone and it returned less than 2 letter words for a Japanese input.
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st Last edited by Edomondo; 01-12-2004 at 01:49 PM. |
|
01-14-2004, 05:16 AM | #18 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
I'm now facing another problem in striping strings.
ex: ¥³¡¼¥Ê¡¼¤â¤¢¤ê¤Þ¤¹ (in EUC-JP) is made of 9 mutli-byte characters: ¥³ ¡¼ ¥Ê ¡¼ ¤â ¤¢ ¤ê ¤Þ ¤¹ But the script I wrote considers ¢¤ as a separator and replace it by a space, though it is the end of ¤¢ and the beginning of ¤ê (2 multi-byte characters). The script returns: ¥³¡¼¥Ê¡¼¤â¤ ê¤Þ¤¹. What result in the end of the string being nosense. The only way to get rid of this bug would be to check each 2 characters to see if it is a mutli-byte character or not, and replace it by a space if it is a separator. But such a script wouldn't be too much time-consuming? Any idea on how to achieve this?
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st |
01-14-2004, 07:43 AM | #19 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Perhaps try the following:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-15-2004, 05:07 AM | #20 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
Hi. Thank you for the help, I'll test the code you submitted. Though it might not work on some rare cases (when ¤ is actually the character before ¢¤) I think I'll go for it.
BTW, other search engines are based on a dictionnary for mutli-byte encodings. The dictionnary is a txt file that contains a word per line. The script extract the longest matching word from the page text and index it. My question is: Would it be possible to implement such a dictionnary tool in phpdig? If so, I would be happy to build a Japanese dictionnary.
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st |
01-15-2004, 07:06 AM | #21 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. There is a ¤¢¤ combo in the $string variable where ¢¤ is not replaced with a space. Did you mean something else?
>> The script extract the longest matching word from the page text and index it. With the mutli-byte dictionary, is it that only the longest matching word from a page gets indexed?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-15-2004, 08:26 AM | #22 | ||
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
Quote:
The script you submitted use a regular expression to prevent replacing ¢¤ if the before is ¤, right? I meant, in the case where the character before ¢¤ is really a multi-byte character ending with ¤, ¢¤ is not replaced. But I think this has a few chance to happen. Quote:
But the dictionnary must be as complete as possible to do a good job. Can it be integrated to phpdig?
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st |
||
01-15-2004, 01:25 PM | #23 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Try using mb_eregi_replace in place of eregi_replace, but note that some of the PHP multi-byte functions are experimental.
As for a dictionary, you might try the following. In spider.php add: PHP Code:
PHP Code:
PHP Code:
PHP Code:
The thing is, of course, to make sure that things that were treated as single-byte and now treated as multi-byte. The $phpdig_words_chars and $phpdig_string_subst variables may need to be treated differently too, so that the characters are seen as multi-byte rather than single-byte. PhpDig was originally written for single-byte use. In theory it seems that it might be able to be converted to multi-byte use, but in practice it's going to take time, tweaking, and in the end hopefully it works.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-17-2004, 03:48 AM | #24 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
Hi. Thank you Charter for your help.
It is working except that there is still a problem. There are no spaces in multi-byte characters encodings. That's why we use a dictionnary that contains all the words of a language to extract words from the text. If it were in english, the phrase: "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheH ubbleSpaceTelescope." would be splitted using a dictionnary containing: "nasa announced yesterday cancelling space shuttle servicing missions hubble space telescope ..." It must also finds the longest words in priority (for example finds the word "yesterday" before the word "day") I tried to use function strstr(), but I haven't succeeded. Can anyone help me?
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st |
01-18-2004, 08:45 AM | #25 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Without seeing the code, you might try using a multi-byte function in place of the strstr function.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-21-2004, 05:42 AM | #26 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
Thank you. So, mb_strpos() seems to be a more sensitive choice, but the server where the search engine is hosted for testings doesn't have the multi-byte functions set on, so I can't check it :-(
But if the dictionary is made correctly, there shouldn't be any problem when using non multi-byte functions. If i use my previous example, it should index all the word from the dictionnary found in the string and take them out of it. At the end, "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheH ubbleSpaceTelescope." would become"itis all tothe". The string of unfound word can also be used to define common words or to expand the dictionnary. What is the quickest and smartest way to achieve this?
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st |
01-22-2004, 01:30 PM | #27 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. It sounds like "itis all tothe" could be written to a file of common words, but as for writing words to a common file versus a dictionary file, that seems to need yet another dicitonary so that a script could determine what file to write to, unless you were to set some sort of parameter such as length/pattern/etcetera so the script would know where to write the phrase.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-27-2004, 06:04 AM | #28 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
I didn't mean using the unfound words for excluded words automatically, but it may be helpful to see which one are not part of the dictionnary and update it if necessary.
However Japanese is not as easy as English and it is just impossible to set a list of words to exclude. I'm still trying to break a sentence into different words using a dictionnary. But my main concerns is about the quickness of the script. Any script are welcomed ! ;-)
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Probleme with japanese search | Paka76 | How-to Forum | 0 | 03-24-2006 04:26 AM |
Please, please, please!!! Troubles with charset!!! | Slayter | Troubleshooting | 0 | 12-21-2005 08:37 AM |
Japanese characters on an English page | Shdwdrgn | Troubleshooting | 1 | 03-15-2005 08:28 AM |
Small fix for Japanese indexing | Edomondo | Mod Submissions | 1 | 02-05-2005 12:40 AM |
Help!How to support gb2312 charset? | peterhou | How-to Forum | 1 | 01-16-2005 12:42 PM |