PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 01-11-2004, 01:48 PM   #16
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Quote:
PHP Code:
$phpdig_words_chars['EUC-JP'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûüý';

$phpdig_words_chars['Shift_JIS'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúû'
There is a range of characters from ~ to _ that is unassigned so if these encodings do not use these characters, you might try the following.
PHP Code:
$phpdig_words_chars['EUC-JP'] = '[:alnum:]_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûüý';

$phpdig_words_chars['Shift_JIS'] = '[:alnum:]_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúû'
If these encodings do use the characters bewteen ~ and _ then you might try the following.
PHP Code:
$phpdig_words_chars['EUC-JP'] = '[:alnum:]€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûüý';

$phpdig_words_chars['Shift_JIS'] = '[:alnum:]€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúû'
In either case though the Latin letters shouldn't need to be included as [:alnum:] takes care of this. Of course, don't allow "bad" characters.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-12-2004, 01:44 PM   #17
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Quote:
Originally posted by Charter
If these encodings do use the characters bewteen ~ and _ then you might try the following.
PHP Code:
$phpdig_words_chars['EUC-JP'] = '[:alnum:]€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûüý';

$phpdig_words_chars['Shift_JIS'] = '[:alnum:]€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúû'
Thank you Charter!

These encodings actually use characters from ~ to _.
This is how I built these strings:
EUC-JP uses character from ASCII 64 to 254 except 127 and 142.
Shift_JIS uses character from ASCII 64 to 252 except 127.
These characters are either in first or second position in multibyte characters.

What do you mean by "bad character"?

BTW, I found the full version of Jcode (the previous one was a Light Edition) at http://www.spencernetwork.org/jcode/.
It has support for UTF-8, but I haven't been able to make it work. Each characters are replaced by ? in the source. I think it must come from an incompatibility in the server. That doesn't work locally neither (on an apache emulator)

Apparently, all what I need to do to index multi-byte words is to change phpdigEpureText for the Japanese encodings to replace the separators by a space and to convert the text to the good encoding.
I haven't tried to implement it in phpdig though, but phpdigEpureText was why no word was indexed in the Japanese pages. I tested this function alone and it returned less than 2 letter words for a Japanese input.

Last edited by Edomondo; 01-12-2004 at 01:49 PM.
Edomondo is offline   Reply With Quote
Old 01-14-2004, 05:16 AM   #18
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
I'm now facing another problem in striping strings.
ex: ¥³¡¼¥Ê¡¼¤â¤¢¤ê¤Þ¤¹ (in EUC-JP)
is made of 9 mutli-byte characters:
¥³ ¡¼ ¥Ê ¡¼ ¤â ¤¢ ¤ê ¤Þ ¤¹
But the script I wrote considers ¢¤ as a separator and replace it by a space, though it is the end of ¤¢ and the beginning of ¤ê (2 multi-byte characters).
The script returns: ¥³¡¼¥Ê¡¼¤â¤ ê¤Þ¤¹. What result in the end of the string being nosense.

The only way to get rid of this bug would be to check each 2 characters to see if it is a mutli-byte character or not, and replace it by a space if it is a separator. But such a script wouldn't be too much time-consuming? Any idea on how to achieve this?
Edomondo is offline   Reply With Quote
Old 01-14-2004, 07:43 AM   #19
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Perhaps try the following:
PHP Code:
<?php
$string 
"¥³¡¼¥Ê¡¼¤â¤¢¤ê¤Þ¤¹AND¼¤â¢¤ê¤Þ¤¹";
$string eregi_replace("([^¤])¢¤","\\\\1 ",$string);
echo 
$string;
// returns ¥³¡¼¥Ê¡¼¤â¤¢¤ê¤Þ¤¹AND¼¤â ê¤Þ¤¹
?>
Also, you may find this tutorial helpful.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-15-2004, 05:07 AM   #20
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Hi. Thank you for the help, I'll test the code you submitted. Though it might not work on some rare cases (when ¤ is actually the character before ¢¤) I think I'll go for it.

BTW, other search engines are based on a dictionnary for mutli-byte encodings. The dictionnary is a txt file that contains a word per line. The script extract the longest matching word from the page text and index it.
My question is: Would it be possible to implement such a dictionnary tool in phpdig?
If so, I would be happy to build a Japanese dictionnary.
Edomondo is offline   Reply With Quote
Old 01-15-2004, 07:06 AM   #21
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. There is a ¤¢¤ combo in the $string variable where ¢¤ is not replaced with a space. Did you mean something else?

>> The script extract the longest matching word from the page text and index it.

With the mutli-byte dictionary, is it that only the longest matching word from a page gets indexed?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-15-2004, 08:26 AM   #22
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Quote:
Originally posted by Charter
Hi. There is a ¤¢¤ combo in the $string variable where ¢¤ is not replaced with a space. Did you mean something else?
Errrr... I'm not sure I understand.
The script you submitted use a regular expression to prevent replacing ¢¤ if the before is ¤, right?
I meant, in the case where the character before ¢¤ is really a multi-byte character ending with ¤, ¢¤ is not replaced. But I think this has a few chance to happen.

Quote:
Originally posted by Charter
>> The script extract the longest matching word from the page text and index it.

With the mutli-byte dictionary, is it that only the longest matching word from a page gets indexed?
No of course, it will extract all the words comparing the page content with the longest words first. Ex : in English, it wouldn't extract "nation" from "internationalization" if "internationalization" is in the dictionnary.
But the dictionnary must be as complete as possible to do a good job.
Can it be integrated to phpdig?
Edomondo is offline   Reply With Quote
Old 01-15-2004, 01:25 PM   #23
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Try using mb_eregi_replace in place of eregi_replace, but note that some of the PHP multi-byte functions are experimental.

As for a dictionary, you might try the following. In spider.php add:
PHP Code:
$my_dictionary phpdigComWords("$relative_script_path/includes/my_dictionary.ext"); 
after the following:
PHP Code:
$common_words phpdigComWords("$relative_script_path/includes/common_words.txt"); 
In robot_functions.php replace:
PHP Code:
if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^[0-9a-zßðþ]',$key)) 
with the following:
PHP Code:
if (mb_strlen($key) > SMALL_WORDS_SIZE and mb_strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and isset($my_dictionary[$key]) and mb_ereg('^['.$phpdig_words_chars[PHPDIG_ENCODING].']',$key)) 
Also apply any other changes given in this thread and use multi-byte functions in place of their single-byte counterparts.

The thing is, of course, to make sure that things that were treated as single-byte and now treated as multi-byte. The $phpdig_words_chars and $phpdig_string_subst variables may need to be treated differently too, so that the characters are seen as multi-byte rather than single-byte.

PhpDig was originally written for single-byte use. In theory it seems that it might be able to be converted to multi-byte use, but in practice it's going to take time, tweaking, and in the end hopefully it works.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-17-2004, 03:48 AM   #24
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Hi. Thank you Charter for your help.
It is working except that there is still a problem.
There are no spaces in multi-byte characters encodings. That's why we use a dictionnary that contains all the words of a language to extract words from the text.

If it were in english, the phrase:
"NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheH ubbleSpaceTelescope."
would be splitted using a dictionnary containing:
"nasa
announced
yesterday
cancelling
space
shuttle
servicing
missions
hubble
space
telescope
..."

It must also finds the longest words in priority (for example finds the word "yesterday" before the word "day")

I tried to use function strstr(), but I haven't succeeded. Can anyone help me?
Edomondo is offline   Reply With Quote
Old 01-18-2004, 08:45 AM   #25
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Without seeing the code, you might try using a multi-byte function in place of the strstr function.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-21-2004, 05:42 AM   #26
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Thank you. So, mb_strpos() seems to be a more sensitive choice, but the server where the search engine is hosted for testings doesn't have the multi-byte functions set on, so I can't check it :-(

But if the dictionary is made correctly, there shouldn't be any problem when using non multi-byte functions.

If i use my previous example, it should index all the word from the dictionnary found in the string and take them out of it. At the end, "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheH ubbleSpaceTelescope." would become"itis all tothe".

The string of unfound word can also be used to define common words or to expand the dictionnary.

What is the quickest and smartest way to achieve this?
Edomondo is offline   Reply With Quote
Old 01-22-2004, 01:30 PM   #27
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. It sounds like "itis all tothe" could be written to a file of common words, but as for writing words to a common file versus a dictionary file, that seems to need yet another dicitonary so that a script could determine what file to write to, unless you were to set some sort of parameter such as length/pattern/etcetera so the script would know where to write the phrase.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-27-2004, 06:04 AM   #28
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
I didn't mean using the unfound words for excluded words automatically, but it may be helpful to see which one are not part of the dictionnary and update it if necessary.

However Japanese is not as easy as English and it is just impossible to set a list of words to exclude.

I'm still trying to break a sentence into different words using a dictionnary. But my main concerns is about the quickness of the script.

Any script are welcomed ! ;-)
Edomondo is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Probleme with japanese search Paka76 How-to Forum 0 03-24-2006 04:26 AM
Please, please, please!!! Troubles with charset!!! Slayter Troubleshooting 0 12-21-2005 08:37 AM
Japanese characters on an English page Shdwdrgn Troubleshooting 1 03-15-2005 08:28 AM
Small fix for Japanese indexing Edomondo Mod Submissions 1 02-05-2005 12:40 AM
Help!How to support gb2312 charset? peterhou How-to Forum 1 01-16-2005 12:42 PM


All times are GMT -8. The time now is 10:16 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.