Japanese encoding : charset=shift_jis - Page 2

Charter · 01-11-2004, 01:48 PM

Quote:

PHP Code:


			
$phpdig_words_chars['EUC-JP'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~â‚¬Ââ€šÆ’â€žâ€¦â€*â€¡Ë†â€°Å*â€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»Ã¼Ã½';



$phpdig_words_chars['Shift_JIS'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~â‚¬Ââ€šÆ’â€žâ€¦â€*â€¡Ë†â€°Å*â€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»';

There is a range of characters from ~ to _ that is unassigned so if these encodings do not use these characters, you might try the following.

PHP Code:


			
$phpdig_words_chars['EUC-JP'] = '[:alnum:]_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»Ã¼Ã½';



$phpdig_words_chars['Shift_JIS'] = '[:alnum:]_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»';

If these encodings do use the characters bewteen ~ and _ then you might try the following.

PHP Code:


			
$phpdig_words_chars['EUC-JP'] = '[:alnum:]â‚¬Ââ€šÆ’â€žâ€¦â€*â€¡Ë†â€°Å*â€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»Ã¼Ã½';



$phpdig_words_chars['Shift_JIS'] = '[:alnum:]â‚¬Ââ€šÆ’â€žâ€¦â€*â€¡Ë†â€°Å*â€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»';

In either case though the Latin letters shouldn't need to be included as [:alnum:] takes care of this. Of course, don't allow "bad" characters.

Edomondo · 01-12-2004, 01:44 PM

Quote:

Originally posted by Charter
If these encodings do use the characters bewteen ~ and _ then you might try the following.

PHP Code:

$phpdig_words_chars['EUC-JP'] = '[:alnum:]â‚¬Ââ€šÆ’â€žâ€¦â€*â€¡Ë†â€°Å*â€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»Ã¼Ã½'; $phpdig_words_chars['Shift_JIS'] = '[:alnum:]â‚¬Ââ€šÆ’â€žâ€¦â€*â€¡Ë†â€°Å*â€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»';

Thank you Charter!

These encodings actually use characters from ~ to _.
This is how I built these strings:
EUC-JP uses character from ASCII 64 to 254 except 127 and 142.
Shift_JIS uses character from ASCII 64 to 252 except 127.
These characters are either in first or second position in multibyte characters.

What do you mean by "bad character"?

BTW, I found the full version of Jcode (the previous one was a Light Edition) at http://www.spencernetwork.org/jcode/.
It has support for UTF-8, but I haven't been able to make it work. Each characters are replaced by ? in the source. I think it must come from an incompatibility in the server. That doesn't work locally neither (on an apache emulator)

Apparently, all what I need to do to index multi-byte words is to change phpdigEpureText for the Japanese encodings to replace the separators by a space and to convert the text to the good encoding.
I haven't tried to implement it in phpdig though, but phpdigEpureText was why no word was indexed in the Japanese pages. I tested this function alone and it returned less than 2 letter words for a Japanese input.

Edomondo · 01-14-2004, 05:16 AM

I'm now facing another problem in striping strings.
ex: Â¥Â³Â¡Â¼Â¥ÃŠÂ¡Â¼Â¤Ã¢Â¤Â¢Â¤ÃªÂ¤ÃžÂ¤Â¹ (in EUC-JP)
is made of 9 mutli-byte characters:
Â¥Â³ Â¡Â¼ Â¥ÃŠ Â¡Â¼ Â¤Ã¢ Â¤Â¢ Â¤Ãª Â¤Ãž Â¤Â¹
But the script I wrote considers Â¢Â¤ as a separator and replace it by a space, though it is the end of Â¤Â¢ and the beginning of Â¤Ãª (2 multi-byte characters).
The script returns: Â¥Â³Â¡Â¼Â¥ÃŠÂ¡Â¼Â¤Ã¢Â¤ ÃªÂ¤ÃžÂ¤Â¹. What result in the end of the string being nosense.

The only way to get rid of this bug would be to check each 2 characters to see if it is a mutli-byte character or not, and replace it by a space if it is a separator. But such a script wouldn't be too much time-consuming? Any idea on how to achieve this?

Charter · 01-14-2004, 07:43 AM

Hi. Perhaps try the following:

PHP Code:


			
<?php

$string = "Â¥Â³Â¡Â¼Â¥ÃŠÂ¡Â¼Â¤Ã¢Â¤Â¢Â¤ÃªÂ¤ÃžÂ¤Â¹ANDÂ¼Â¤Ã¢Â¢Â¤ÃªÂ¤ÃžÂ¤Â¹";

$string = eregi_replace("([^Â¤])Â¢Â¤","\\\\1 ",$string);

echo $string;

// returns Â¥Â³Â¡Â¼Â¥ÃŠÂ¡Â¼Â¤Ã¢Â¤Â¢Â¤ÃªÂ¤ÃžÂ¤Â¹ANDÂ¼Â¤Ã¢ ÃªÂ¤ÃžÂ¤Â¹

?>

Also, you may find this tutorial helpful.

Edomondo · 01-15-2004, 05:07 AM

Hi. Thank you for the help, I'll test the code you submitted. Though it might not work on some rare cases (when Â¤ is actually the character before Â¢Â¤) I think I'll go for it.

BTW, other search engines are based on a dictionnary for mutli-byte encodings. The dictionnary is a txt file that contains a word per line. The script extract the longest matching word from the page text and index it.
My question is: Would it be possible to implement such a dictionnary tool in phpdig?
If so, I would be happy to build a Japanese dictionnary.

Charter · 01-15-2004, 07:06 AM

Hi. There is a Â¤Â¢Â¤ combo in the $string variable where Â¢Â¤ is not replaced with a space. Did you mean something else?

>> The script extract the longest matching word from the page text and index it.

With the mutli-byte dictionary, is it that only the longest matching word from a page gets indexed?

Edomondo · 01-15-2004, 08:26 AM

Quote:

Originally posted by Charter
Hi. There is a Â¤Â¢Â¤ combo in the $string variable where Â¢Â¤ is not replaced with a space. Did you mean something else?

Errrr... I'm not sure I understand.
The script you submitted use a regular expression to prevent replacing Â¢Â¤ if the before is Â¤, right?
I meant, in the case where the character before Â¢Â¤ is really a multi-byte character ending with Â¤, Â¢Â¤ is not replaced. But I think this has a few chance to happen.

Quote:

Originally posted by Charter
>> The script extract the longest matching word from the page text and index it.

With the mutli-byte dictionary, is it that only the longest matching word from a page gets indexed?

No of course, it will extract all the words comparing the page content with the longest words first. Ex : in English, it wouldn't extract "nation" from "internationalization" if "internationalization" is in the dictionnary.
But the dictionnary must be as complete as possible to do a good job.
Can it be integrated to phpdig?

Charter · 01-15-2004, 01:25 PM

Hi. Try using mb_eregi_replace in place of eregi_replace, but note that some of the PHP multi-byte functions are experimental.

As for a dictionary, you might try the following. In spider.php add:

PHP Code:


			
$my_dictionary = phpdigComWords("$relative_script_path/includes/my_dictionary.ext");

after the following:

PHP Code:


			
$common_words = phpdigComWords("$relative_script_path/includes/common_words.txt");

In robot_functions.php replace:

PHP Code:


			
if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^[0-9a-zÃŸÃ°Ã¾]',$key))

with the following:

PHP Code:


			
if (mb_strlen($key) > SMALL_WORDS_SIZE and mb_strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and isset($my_dictionary[$key]) and mb_ereg('^['.$phpdig_words_chars[PHPDIG_ENCODING].']',$key))

Also apply any other changes given in this thread and use multi-byte functions in place of their single-byte counterparts.

The thing is, of course, to make sure that things that were treated as single-byte and now treated as multi-byte. The $phpdig_words_chars and $phpdig_string_subst variables may need to be treated differently too, so that the characters are seen as multi-byte rather than single-byte.

PhpDig was originally written for single-byte use. In theory it seems that it might be able to be converted to multi-byte use, but in practice it's going to take time, tweaking, and in the end hopefully it works.

Edomondo · 01-17-2004, 03:48 AM

Hi. Thank you Charter for your help.
It is working except that there is still a problem.
There are no spaces in multi-byte characters encodings. That's why we use a dictionnary that contains all the words of a language to extract words from the text.

If it were in english, the phrase:
"NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheH ubbleSpaceTelescope."
would be splitted using a dictionnary containing:
"nasa
announced
yesterday
cancelling
space
shuttle
servicing
missions
hubble
space
telescope
..."

It must also finds the longest words in priority (for example finds the word "yesterday" before the word "day")

I tried to use function strstr(), but I haven't succeeded. Can anyone help me?

Charter · 01-18-2004, 08:45 AM

Hi. Without seeing the code, you might try using a multi-byte function in place of the strstr function.

Edomondo · 01-21-2004, 05:42 AM

Thank you. So, mb_strpos() seems to be a more sensitive choice, but the server where the search engine is hosted for testings doesn't have the multi-byte functions set on, so I can't check it :-(

But if the dictionary is made correctly, there shouldn't be any problem when using non multi-byte functions.

If i use my previous example, it should index all the word from the dictionnary found in the string and take them out of it. At the end, "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheH ubbleSpaceTelescope." would become"itis all tothe".

The string of unfound word can also be used to define common words or to expand the dictionnary.

What is the quickest and smartest way to achieve this?

Charter · 01-22-2004, 01:30 PM

Hi. It sounds like "itis all tothe" could be written to a file of common words, but as for writing words to a common file versus a dictionary file, that seems to need yet another dicitonary so that a script could determine what file to write to, unless you were to set some sort of parameter such as length/pattern/etcetera so the script would know where to write the phrase.

Edomondo · 01-27-2004, 06:04 AM

I didn't mean using the unfound words for excluded words automatically, but it may be helpful to see which one are not part of the dictionnary and update it if necessary.

However Japanese is not as easy as English and it is just impossible to set a list of words to exclude.

I'm still trying to break a sentence into different words using a dictionnary. But my main concerns is about the quickness of the script.

Any script are welcomed ! ;-)

01-14-2004, 05:16 AM	#18
Edomondo Orange Mole Join Date: Jan 2004 Location: In outer space Posts: 37	I'm now facing another problem in striping strings. ex: Â¥Â³Â¡Â¼Â¥ÃŠÂ¡Â¼Â¤Ã¢Â¤Â¢Â¤ÃªÂ¤ÃžÂ¤Â¹ (in EUC-JP) is made of 9 mutli-byte characters: Â¥Â³ Â¡Â¼ Â¥ÃŠ Â¡Â¼ Â¤Ã¢ Â¤Â¢ Â¤Ãª Â¤Ãž Â¤Â¹ But the script I wrote considers Â¢Â¤ as a separator and replace it by a space, though it is the end of Â¤Â¢ and the beginning of Â¤Ãª (2 multi-byte characters). The script returns: Â¥Â³Â¡Â¼Â¥ÃŠÂ¡Â¼Â¤Ã¢Â¤ ÃªÂ¤ÃžÂ¤Â¹. What result in the end of the string being nosense. The only way to get rid of this bug would be to check each 2 characters to see if it is a mutli-byte character or not, and replace it by a space if it is a separator. But such a script wouldn't be too much time-consuming? Any idea on how to achieve this? __________________ UchÃ» Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st

01-14-2004, 07:43 AM	#19
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. Perhaps try the following: PHP Code: `<?php $string = "Â¥Â³Â¡Â¼Â¥ÃŠÂ¡Â¼Â¤Ã¢Â¤Â¢Â¤ÃªÂ¤ÃžÂ¤Â¹ANDÂ¼Â¤Ã¢Â¢Â¤ÃªÂ¤ÃžÂ¤Â¹"; $string = eregi_replace("([^Â¤])Â¢Â¤","\\\\1 ",$string); echo $string; // returns Â¥Â³Â¡Â¼Â¥ÃŠÂ¡Â¼Â¤Ã¢Â¤Â¢Â¤ÃªÂ¤ÃžÂ¤Â¹ANDÂ¼Â¤Ã¢ ÃªÂ¤ÃžÂ¤Â¹ ?>` Also, you may find this tutorial helpful. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

01-15-2004, 05:07 AM	#20
Edomondo Orange Mole Join Date: Jan 2004 Location: In outer space Posts: 37	Hi. Thank you for the help, I'll test the code you submitted. Though it might not work on some rare cases (when Â¤ is actually the character before Â¢Â¤) I think I'll go for it. BTW, other search engines are based on a dictionnary for mutli-byte encodings. The dictionnary is a txt file that contains a word per line. The script extract the longest matching word from the page text and index it. My question is: Would it be possible to implement such a dictionnary tool in phpdig? If so, I would be happy to build a Japanese dictionnary. __________________ UchÃ» Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st

01-15-2004, 07:06 AM	#21
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. There is a Â¤Â¢Â¤ combo in the $string variable where Â¢Â¤ is not replaced with a space. Did you mean something else? >> The script extract the longest matching word from the page text and index it. With the mutli-byte dictionary, is it that only the longest matching word from a page gets indexed? __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

01-15-2004, 01:25 PM	#23
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. Try using mb_eregi_replace in place of eregi_replace, but note that some of the PHP multi-byte functions are experimental. As for a dictionary, you might try the following. In spider.php add: PHP Code: `$my_dictionary = phpdigComWords("$relative_script_path/includes/my_dictionary.ext");` after the following: PHP Code: `$common_words = phpdigComWords("$relative_script_path/includes/common_words.txt");` In robot_functions.php replace: PHP Code: `if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^[0-9a-zÃŸÃ°Ã¾]',$key))` with the following: PHP Code: `if (mb_strlen($key) > SMALL_WORDS_SIZE and mb_strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and isset($my_dictionary[$key]) and mb_ereg('^['.$phpdig_words_chars[PHPDIG_ENCODING].']',$key))` Also apply any other changes given in this thread and use multi-byte functions in place of their single-byte counterparts. The thing is, of course, to make sure that things that were treated as single-byte and now treated as multi-byte. The $phpdig_words_chars and $phpdig_string_subst variables may need to be treated differently too, so that the characters are seen as multi-byte rather than single-byte. PhpDig was originally written for single-byte use. In theory it seems that it might be able to be converted to multi-byte use, but in practice it's going to take time, tweaking, and in the end hopefully it works. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

01-17-2004, 03:48 AM	#24
Edomondo Orange Mole Join Date: Jan 2004 Location: In outer space Posts: 37	Hi. Thank you Charter for your help. It is working except that there is still a problem. There are no spaces in multi-byte characters encodings. That's why we use a dictionnary that contains all the words of a language to extract words from the text. If it were in english, the phrase: "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheH ubbleSpaceTelescope." would be splitted using a dictionnary containing: "nasa announced yesterday cancelling space shuttle servicing missions hubble space telescope ..." It must also finds the longest words in priority (for example finds the word "yesterday" before the word "day") I tried to use function strstr(), but I haven't succeeded. Can anyone help me? __________________ UchÃ» Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st

01-18-2004, 08:45 AM	#25
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. Without seeing the code, you might try using a multi-byte function in place of the strstr function. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

01-21-2004, 05:42 AM	#26
Edomondo Orange Mole Join Date: Jan 2004 Location: In outer space Posts: 37	Thank you. So, mb_strpos() seems to be a more sensitive choice, but the server where the search engine is hosted for testings doesn't have the multi-byte functions set on, so I can't check it :-( But if the dictionary is made correctly, there shouldn't be any problem when using non multi-byte functions. If i use my previous example, it should index all the word from the dictionnary found in the string and take them out of it. At the end, "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheH ubbleSpaceTelescope." would become"itis all tothe". The string of unfound word can also be used to define common words or to expand the dictionnary. What is the quickest and smartest way to achieve this? __________________ UchÃ» Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st

01-22-2004, 01:30 PM	#27
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. It sounds like "itis all tothe" could be written to a file of common words, but as for writing words to a common file versus a dictionary file, that seems to need yet another dicitonary so that a script could determine what file to write to, unless you were to set some sort of parameter such as length/pattern/etcetera so the script would know where to write the phrase. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

01-27-2004, 06:04 AM	#28
Edomondo Orange Mole Join Date: Jan 2004 Location: In outer space Posts: 37	I didn't mean using the unfound words for excluded words automatically, but it may be helpful to see which one are not part of the dictionnary and update it if necessary. However Japanese is not as easy as English and it is just impossible to set a list of words to exclude. I'm still trying to break a sentence into different words using a dictionnary. But my main concerns is about the quickness of the script. Any script are welcomed ! ;-) __________________ UchÃ» Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Probleme with japanese search	Paka76	How-to Forum	0	03-24-2006 04:26 AM
Please, please, please!!! Troubles with charset!!!	Slayter	Troubleshooting	0	12-21-2005 08:37 AM
Japanese characters on an English page	Shdwdrgn	Troubleshooting	1	03-15-2005 08:28 AM
Small fix for Japanese indexing	Edomondo	Mod Submissions	1	02-05-2005 12:40 AM
Help!How to support gb2312 charset?	peterhou	How-to Forum	1	01-16-2005 12:42 PM