PhpDig.net - Japanese encoding : charset=shift

PhpDig.net (http://www.phpdig.net/forum/index.php)

- How-to Forum (http://www.phpdig.net/forum/forumdisplay.php?f=33)

- - Japanese encoding : charset=shift_jis (http://www.phpdig.net/forum/showthread.php?t=355)

Quote:

PHP Code:


		
			
$phpdig_words_chars['EUC-JP'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~â‚¬Ââ€šÆ’â€žâ€¦â€*â€¡Ë†â€°Å*â€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»Ã¼Ã½';



$phpdig_words_chars['Shift_JIS'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~â‚¬Ââ€šÆ’â€žâ€¦â€*â€¡Ë†â€°Å*â€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»';

There is a range of characters from ~ to _ that is unassigned so if these encodings do not use these characters, you might try the following.

PHP Code:


		
			
$phpdig_words_chars['EUC-JP'] = '[:alnum:]_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»Ã¼Ã½';



$phpdig_words_chars['Shift_JIS'] = '[:alnum:]_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»';

If these encodings do use the characters bewteen ~ and _ then you might try the following.

PHP Code:


		
			
$phpdig_words_chars['EUC-JP'] = '[:alnum:]â‚¬Ââ€šÆ’â€žâ€¦â€*â€¡Ë†â€°Å*â€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»Ã¼Ã½';



$phpdig_words_chars['Shift_JIS'] = '[:alnum:]â‚¬Ââ€šÆ’â€žâ€¦â€*â€¡Ë†â€°Å*â€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»';

In either case though the Latin letters shouldn't need to be included as [:alnum:] takes care of this. Of course, don't allow "bad" characters.

Quote:

Originally posted by Charter
If these encodings do use the characters bewteen ~ and _ then you might try the following.

PHP Code:

$phpdig_words_chars['EUC-JP'] = '[:alnum:]â‚¬Ââ€šÆ’â€žâ€¦â€*â€¡Ë†â€°Å*â€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»Ã¼Ã½'; $phpdig_words_chars['Shift_JIS'] = '[:alnum:]â‚¬Ââ€šÆ’â€žâ€¦â€*â€¡Ë†â€°Å*â€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»';

Thank you Charter!

These encodings actually use characters from ~ to _.
This is how I built these strings:
EUC-JP uses character from ASCII 64 to 254 except 127 and 142.
Shift_JIS uses character from ASCII 64 to 252 except 127.
These characters are either in first or second position in multibyte characters.

What do you mean by "bad character"?

BTW, I found the full version of Jcode (the previous one was a Light Edition) at http://www.spencernetwork.org/jcode/.
It has support for UTF-8, but I haven't been able to make it work. Each characters are replaced by ? in the source. I think it must come from an incompatibility in the server. That doesn't work locally neither (on an apache emulator)

Apparently, all what I need to do to index multi-byte words is to change phpdigEpureText for the Japanese encodings to replace the separators by a space and to convert the text to the good encoding.
I haven't tried to implement it in phpdig though, but phpdigEpureText was why no word was indexed in the Japanese pages. I tested this function alone and it returned less than 2 letter words for a Japanese input.

I'm now facing another problem in striping strings.
ex: Â¥Â³Â¡Â¼Â¥ÃŠÂ¡Â¼Â¤Ã¢Â¤Â¢Â¤ÃªÂ¤ÃžÂ¤Â¹ (in EUC-JP)
is made of 9 mutli-byte characters:
Â¥Â³ Â¡Â¼ Â¥ÃŠ Â¡Â¼ Â¤Ã¢ Â¤Â¢ Â¤Ãª Â¤Ãž Â¤Â¹
But the script I wrote considers Â¢Â¤ as a separator and replace it by a space, though it is the end of Â¤Â¢ and the beginning of Â¤Ãª (2 multi-byte characters).
The script returns: Â¥Â³Â¡Â¼Â¥ÃŠÂ¡Â¼Â¤Ã¢Â¤ ÃªÂ¤ÃžÂ¤Â¹. What result in the end of the string being nosense.

The only way to get rid of this bug would be to check each 2 characters to see if it is a mutli-byte character or not, and replace it by a space if it is a separator. But such a script wouldn't be too much time-consuming? Any idea on how to achieve this?

Hi. Perhaps try the following:

PHP Code:


		
			
<?php

$string = "Â¥Â³Â¡Â¼Â¥ÃŠÂ¡Â¼Â¤Ã¢Â¤Â¢Â¤ÃªÂ¤ÃžÂ¤Â¹ANDÂ¼Â¤Ã¢Â¢Â¤ÃªÂ¤ÃžÂ¤Â¹";

$string = eregi_replace("([^Â¤])Â¢Â¤","\\\\1 ",$string);

echo $string;

// returns Â¥Â³Â¡Â¼Â¥ÃŠÂ¡Â¼Â¤Ã¢Â¤Â¢Â¤ÃªÂ¤ÃžÂ¤Â¹ANDÂ¼Â¤Ã¢ ÃªÂ¤ÃžÂ¤Â¹

?>

Also, you may find this tutorial helpful.

Hi. Thank you for the help, I'll test the code you submitted. Though it might not work on some rare cases (when Â¤ is actually the character before Â¢Â¤) I think I'll go for it.

BTW, other search engines are based on a dictionnary for mutli-byte encodings. The dictionnary is a txt file that contains a word per line. The script extract the longest matching word from the page text and index it.
My question is: Would it be possible to implement such a dictionnary tool in phpdig?
If so, I would be happy to build a Japanese dictionnary.

Hi. There is a Â¤Â¢Â¤ combo in the $string variable where Â¢Â¤ is not replaced with a space. Did you mean something else?

>> The script extract the longest matching word from the page text and index it.

With the mutli-byte dictionary, is it that only the longest matching word from a page gets indexed?

Quote:

Originally posted by Charter
Hi. There is a Â¤Â¢Â¤ combo in the $string variable where Â¢Â¤ is not replaced with a space. Did you mean something else?

Errrr... I'm not sure I understand.
The script you submitted use a regular expression to prevent replacing Â¢Â¤ if the before is Â¤, right?
I meant, in the case where the character before Â¢Â¤ is really a multi-byte character ending with Â¤, Â¢Â¤ is not replaced. But I think this has a few chance to happen.

Quote:

Originally posted by Charter
>> The script extract the longest matching word from the page text and index it.

With the mutli-byte dictionary, is it that only the longest matching word from a page gets indexed?

No of course, it will extract all the words comparing the page content with the longest words first. Ex : in English, it wouldn't extract "nation" from "internationalization" if "internationalization" is in the dictionnary.
But the dictionnary must be as complete as possible to do a good job.
Can it be integrated to phpdig?

Hi. Try using mb_eregi_replace in place of eregi_replace, but note that some of the PHP multi-byte functions are experimental.

As for a dictionary, you might try the following. In spider.php add:

PHP Code:


		
			
$my_dictionary = phpdigComWords("$relative_script_path/includes/my_dictionary.ext");

after the following:

PHP Code:


		
			
$common_words = phpdigComWords("$relative_script_path/includes/common_words.txt");

In robot_functions.php replace:

PHP Code:


		
			
if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^[0-9a-zÃŸÃ°Ã¾]',$key))

with the following:

PHP Code:


		
			
if (mb_strlen($key) > SMALL_WORDS_SIZE and mb_strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and isset($my_dictionary[$key]) and mb_ereg('^['.$phpdig_words_chars[PHPDIG_ENCODING].']',$key))

Also apply any other changes given in this thread and use multi-byte functions in place of their single-byte counterparts.

The thing is, of course, to make sure that things that were treated as single-byte and now treated as multi-byte. The $phpdig_words_chars and $phpdig_string_subst variables may need to be treated differently too, so that the characters are seen as multi-byte rather than single-byte.

PhpDig was originally written for single-byte use. In theory it seems that it might be able to be converted to multi-byte use, but in practice it's going to take time, tweaking, and in the end hopefully it works.

Hi. Thank you Charter for your help.
It is working except that there is still a problem.
There are no spaces in multi-byte characters encodings. That's why we use a dictionnary that contains all the words of a language to extract words from the text.

If it were in english, the phrase:
"NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheH ubbleSpaceTelescope."
would be splitted using a dictionnary containing:
"nasa
announced
yesterday
cancelling
space
shuttle
servicing
missions
hubble
space
telescope
..."

It must also finds the longest words in priority (for example finds the word "yesterday" before the word "day")

I tried to use function strstr(), but I haven't succeeded. Can anyone help me?

Hi. Without seeing the code, you might try using a multi-byte function in place of the strstr function.

Thank you. So, mb_strpos() seems to be a more sensitive choice, but the server where the search engine is hosted for testings doesn't have the multi-byte functions set on, so I can't check it :-(

But if the dictionary is made correctly, there shouldn't be any problem when using non multi-byte functions.

If i use my previous example, it should index all the word from the dictionnary found in the string and take them out of it. At the end, "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheH ubbleSpaceTelescope." would become"itis all tothe".

The string of unfound word can also be used to define common words or to expand the dictionnary.

What is the quickest and smartest way to achieve this?

Hi. It sounds like "itis all tothe" could be written to a file of common words, but as for writing words to a common file versus a dictionary file, that seems to need yet another dicitonary so that a script could determine what file to write to, unless you were to set some sort of parameter such as length/pattern/etcetera so the script would know where to write the phrase.

I didn't mean using the unfound words for excluded words automatically, but it may be helpful to see which one are not part of the dictionnary and update it if necessary.

However Japanese is not as easy as English and it is just impossible to set a list of words to exclude.

I'm still trying to break a sentence into different words using a dictionnary. But my main concerns is about the quickness of the script.

Any script are welcomed ! ;-)