PDA

View Full Version : Japanese encoding : charset=shift_jis


Edomondo
01-05-2004, 08:28 AM
Hi there! PhpDig is just great!

I'm considering improving the system so that it can spider and display different encoding at the same time among which the Japanese encoding.

I've read carefully the 3 other topics dedicated to encoding issues.

I've though of 4 main points:
- There is no space in Japanese, so an algorythm can't make no difference between 2 different words. And it seems to be impossible to stock keywords in the DB. But I can think of a way to pass through it. There are 3 different types of characters in Japanese :
Hiragana (26 signs)
Katakana (26 signs)
Kanji (more than 50,000 signs)
It is possible to make a difference refering to the code of theses characters. e.g.: Katakana "re" (レ) is encoded with  ¼. if the code of the second character of the encoding is between x and y, then the character is a Katakana. Same for the others.
But some words contains different type of characters at the same time, like サボる (katakana + hiragana. means "not to attend school"), キャンプ場 (Katakana + Kanji. means "camping"), 寝る (Kanji + Hiragana, means "to sleep")
- There are different Japanese encoding: Shift_JIS, iso-2022-jp and EUC-JP. So how to crawl pages with different encodings?
- ア is the same as ア, イ as イ, ウ as ウ, カ as カ... Apart from these signs (about 50) no other matches can be done (like "*", "â" being like "a")

Sounds pretty hard, but it is not.

Any idea on how to do it? Can anyone give me hand on this?

Charter
01-06-2004, 04:33 AM
Hi. Assuming Shift_JIS, ISO-2022-JP and EUC-JP are different encodings for the same set of characters, then a utility could be written or may already be available to convert both ISO-2022-JP and Shift_JIS to EUC-JP. The utility could be invoked based on the charset atttribute of the meta tag, converting as nessary for storage in MySQL with charset ujis and utilizing mutli-byte (http://www.php.net/mbstring) string functions where needed. This method could also be used to convert from a number of encodings to UTF-8.

Edomondo
01-06-2004, 08:00 AM
You're right, they are all different encodings for the same characters set. The most current is probably Shift_JIS.
I suppose that developping such an utility wouldn't be problem for me, but I'll be looking for something similar over the net.

But how to deal with the space issue?
Phpdig won't be able to index words. I can't find of any way to pass through this.
Will it have to index each phrase separately as a single word?

Charter
01-06-2004, 10:10 AM
Hi. In robot_functions.php is $separators = " "; and this is what breaks on space for keywords. You could add other characters to $separators, but I am not familiar enough with Japanese to suggest appropriate separators. :(

One other thing from php.net (http://www.php.net/mbstring):


Character encodings work with PHP:
ISO-8859-*, EUC-JP, UTF-8

Character encodings do NOT work with PHP:
JIS, SJIS

Edomondo
01-06-2004, 12:38 PM
Hi! I found exactly what I needed!!
http://www.spencernetwork.org/jcode-LE/
There are a functions to convert from/to EUC-JP, Shift_JIS & ISO-2022-JP(JIS) and others to convert from/to full-width & half-width character.
I didn't find time to test it yet. The documentation is in Japanese but I can make a translation if anyone is interested in it.

I've thought about separators in Japanese. There is half and full-width space, half and full-width period, half and full-width comma, half and full-width apostrophe, ...
Can I add several separator in the same string? I guess I must enter unencoded characters e.g.  @ for full-width space, correct? Or do I need to separate each separator by a sign (comma...)?

BTW, thanks for your help Charter. ;) It's greatly appreciated.

Charter
01-06-2004, 12:59 PM
Hi. For example, say the "word" was "big/long/phrase and then some" and you wanted to break this up. You could set $separators = " /"; so that keywords would be made on spaces and slashes. More info on strtok can be found here (http://www.php.net/manual/en/function.strtok.php). Also, I'd be interested in the translation.

Edomondo
01-07-2004, 06:41 AM
I've done a translation of Jcode-LE readme.txt file. It might not always be clear as neither English nor Japanese are my mother tongues :-( I've also indicated where there should have been Japanese characters (replaced by *** due to the txt format)

That would be great if next versions of phpdig would accept several encodings in both indexing and interface.

Thanks, strtok() is clearer to me now.

Edomondo
01-08-2004, 06:51 AM
I did some testing.
Jcode works great to convert from a Japanese encodings to another one. it is definitely what I was looking for!

strtok() can use multiple separator pattern made of more than 1 character.

But:

<?php
$string = "This/*is/*an/*example/*string";
$separator = "/*";
/* Use tab and newline as tokenizing characters as well */
$tok = strtok($string, $separator);
while ($tok) {
echo "Word=$tok<br />";
$tok = strtok($separator);
}
?>

Must be replaced by:

<?php
$string = "This/*is/*an/*example/*string";
$separator = "/*";

$tok = strtok($string, $separator);
while ($tok !== FALSE)
{
$toks[] = $tok;
$tok = strtok($separator);
}

while (list($k,$v) = each($toks))
{
echo "Word=$v<br />";
}
?>

So, it might be possible to use multi-byte characters to separate words. Am I right?

Now I'm going to need help configuring correctly $phpdig_words_chars and $phpdig_string_subst.

Charter
01-08-2004, 07:27 AM
Hi. There is a fix here (http://www.phpdig.net/showthread.php?threadid=297) thanks to janalwin. This avoids the problems when $token evaluates to false with zero in the text. I'm not sure why you are adding the second while loop. With while ($tok !== FALSE) all of the $tok will print, but with while ($tok) printing stops after string.

$string = "This/*is/*an/*example/*string/0/and/some*more*text";
$separator = "/*";
$tok = strtok($string, $separator);
while ($tok !== FALSE) { // try with while ($tok) to compare
echo "Word=$tok<br />";
$tok = strtok($separator);
}

Note how $separator is used to tokenize on / and * and what appears to be /* but it really is tokenizing when any one character is found. The while makes it tokenize on / and * so it appears to only break on */ but that is not the case.

Edomondo
01-08-2004, 08:33 AM
Damned! I'm afraid you're right. "/*" is use to tokenize the string just like "*/", "/" & "*"... So it's not working as I'd like it to.

It should use for separator:
- a single character (space, comma, period...)
- a multi-byte character code (in @ is space, A is comma, B is period...)

I ran short of idea on this issue :(

Edomondo
01-08-2004, 12:04 PM
OK, I found how to replace the punctuation.
In the previous example, using /* to tokenize the string can be achieved this way:

<?php
$string = "This/*is/*an/*example/*0/*string.";
$separator = " ";

$replace_separator = array("/*" => $separator,
"." => $separator);

$string = trim(strtr($string, $replace_separator));

$tok = strtok($string, $separator);
while ($tok !== FALSE) { // try with while ($tok) to compare
echo "Word=$tok<br />";
$tok = strtok($separator);
}
?>

I guess I'll also have to give MAX_WORDS_SIZE the higher value possible.

Now, how can I configure $phpdig_string_subst['EUC-JP'] and $phpdig_string_chars['EUC-JP']? It's still a bit confusing to me.

Every character composing a multi-byte character will go in $phpdig_string_subst, right?
e.g. : ‚ÆÄ*l‹CÌ_é–Ÿ‰æÅ...

And $phpdig_string_chars['EUC-JP'] = '[:alnum:]'; seems correct as all characters will be converted to half-width EUC-JP characters during indexing.

Charter
01-08-2004, 03:39 PM
Hi. I haven't done any benchmarks so I'm not sure, but for a lot of processing the following might be faster:

$separator = " ";
$string = "This/*is/*an/*example/*0/*string.";
$string = str_replace("/*"," ",$string);
$tok = strtok($string, $separator);
while ($tok !== FALSE) {
echo "Word=$tok<br />";
$tok = strtok($separator);
}

As for the $phpdig_string_subst and $phpdig_words_chars variables, $phpdig_string_subst['EUC-JP'] = 'Q:Q,q:q'; and $phpdig_words_chars['EUC-JP'] = '[:alnum:]ÆÄ*l‹CÌ_é–Ÿ‰æÅ...';

This seems backwards when reading the instructions in the config.php file, but with encodings that don't have Latin counterparts, it's the way I figured to make PhpDig version 1.6.5 work with other languages.

Is you MySQL charset ujis? You can find some MySQL charsets and their descriptions here (http://www.mysql.com/doc/en/Charset-asian-sets.html).

Edomondo
01-09-2004, 05:59 AM
Actually, the search engine I'm aiming at will have a support for different languages and encodings. Japanese characters will be processed like non multi-byte characters. The decoding to Japanese characters will be done at the end in the browser. Storage in the DB and plain TXT files will contain non-encoded characters.
So I prefer using a common charset in MySQL, not a specific one for Japanese.

I've set a list of all the possible separator (non encoded) in Japanese.
For Shift_Jis encoding, there will be:
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~

€

‚
ƒ
"
…
*
‡
ˆ
‰
*
‹
Œ

Ž


'
'
"
"
o
-
-
˜
™
š
›
œ

ž
Ÿ

¡
¢
£
¤
¥
¦
§
¨
©
ª
"

¸
¹
º
"
¼
½
¾
¿
È
É
Ê
Ë
Ì
Í
Î
Ú
Û
Ü
Ý
Þ
ß
*
á
â
ã
ä
å
æ
ç
è
ð
ñ
ò
ó
ô
õ
ö
÷
ü

What would be the fastest way to achieve this?

$phpdig_string_subst for Shift_Jis would look like:

$phpdig_string_subst['Shift_Jis'] = 'A:A,a:a,B:B,b:b,C:C,c:c,D:D,d:d,E:E,e:e,F:F,f:f,G:G,g:g,H:H,h:h,I:I,i:i,J: J,j:j,K:K,k:k,L:L,l:l,M:M,m:m,N:N,n:n,O:O,o:o,P:P,p:p,Q:Q,q:q,R:R,r:r,S:S,s :s,T:T,t:t,U:U,u:u,V:V,v:v,W:W,w:w,X:X,x:x,Y:Y,y:y,Z:Z,z:z';

Is that correct?

Building a correct $phpdig_words_chars wouldn't be a problem too. I'll post one try soon for both Shift_Jis and EUC-JP.

Charter
01-09-2004, 06:35 AM
Japanese characters will be processed like non multi-byte characters. The decoding to Japanese characters will be done at the end in the browser.

Hi. I'm not sure what you mean. Are you planning on storing HTML entites instead?

The $phpdig_string_subst['Shift_Jis'] variable posted isn't correct. There is no need to all Latin letters in the variable. Setting $phpdig_string_subst['EUC-JP'] = 'Q:Q,q:q'; is all that is necessary if no "transformation" between characters is needed.

If you are looking to incorporate mutliple encodings, you might consider UTF-8 instead.

Edomondo
01-09-2004, 12:41 PM
Hi. I meant each Japanese character will be considered as a couple of single-byte characters.

I can't use UTF-8 because Jcode-LE only cope with EUC-JP, Shift_JIS and ISO-2022-JP (JIS). The indexed pages can only be encoded with one of those encodings. They will all be converted to EUC-JP in this project.

The content of indexed pages will have to:
- be converted to the reference encoding of the site (EUC-JP in this case) using Jcode-LE.
- get the punctuations signs replaced by spaces with strtr or str_replace.
Is that correct? Will it be enough to make it work?

As part of phrases (<> word) will be indexed, the search will be performed on part of words.

This is the list of separator for EUC-JP:
¢£
¢¤
¢¥
¢¦
¢§
¢¨
¢©
¢ª
¢«
¡¦
¢_
¢®
¢º
¢»
¢¼
¢½
¢¾
¢¿
¢À
¢Á
¢Ê
¢Ë
¢Ì
¢Í
¢Î
¢Ï
¢Ð
¢Ü
¢Ý
¢Þ
¢ß
¢*
¢á
¢â
¢ã
¢ä
¢å
¡¢
¢æ
¡£
¢ç
¡¤
¢è
¡¥
¢é
¢ê
¡§
¡¨
¡©
¡ª
¡«
¡¬
¡_
¡®
¢ò
¡¯
¢ó
¡°
¢ô
¡±
¢õ
¡²
¢ö
¡³
¢÷
¡´
¢ø
¡µ
¢ù
¡¶
¡·
¡¸
¡¹
¡º
¢þ
¡»
¡¼
¡½
¡¾
¡¿
¡À
¡Á
¡Â
¡Ã
¡Ä
¡Å
¡Æ
¡Ç
¡È
¡É
¡Ê
¡Ë
¡Ì
¡Í
¡Î
¡Ï
¡Ð
¡Ñ
¡Ò
¡Ó
¡Ô
¡Õ
¡Ö
¡×
¡Ø
¡Ù
¡Ú
¡Û
¡Ü
¡Ý
¡Þ
¡ß
¡¦
¡*
¡á
¡â
¡ã
¡ä
¡å
¡æ
¡ç
¡è
¡é
¡ê
¡ë
¡ì
¡*
¡î
¡ï
¡ð
¡ñ
¡ò
¡ó
¡ô
¡õ
¡ö
¡÷
¡ø
¡ù
¡ú
¡û
¡ü
¡ý
¡þ
¢¡
¡¢
¢¢
¡£
¢£
¡¤
¢¤
¡¥
¢¥
¢¦
¡§
¢§
¡¨
¢¨
¡©
¢©
¡ª
¢ª
¡«
¢«
¡¬
¢¬
¡_
¢_
¡®
¢®
¡¯
¡°
¡±
¡²
¡³
¡´
¡µ
¡¶
¡·
¡¸
¡¹
¡º
¢º
¡»
¢»
¡¼
¢¼
¡½
¢½
¡¾
¢¾
¡¿
¢¿
¡À
¢À
¡Á
¢Á
¡Â
¡Ã
¡Ä
¡Å
¡Æ
¡Ç
¡È
¡É
¡Ê
¢Ê
¡Ë
¢Ë
¡Ì
¢Ì
¡Í
¢Í
¡Î
¢Î
¡Ï
¢Ï
¡Ð
¢Ð
¡Ñ
¡Ò
¡Ó
¡Ô
¡Õ
¡Ö
¡×
¡Ø
¡Ù
¡Ú
¡Û
¡Ü
¢Ü
¡Ý
¢Ý
¡Þ
¢Þ
¡ß
¢ß
¡*
¢*
¡á
¢á
¡â
¢â
¡ã
¢ã
¡ä
¢ä
¡å
¢å
¡æ
¢æ
¡ç
¢ç
¡è
¢è
¡é
¢é
¡ê
¢ê
¡ë
¡ì
¡*
¡î
¡ï
¡ð
¡ñ
¡ò
¢ò
¡ó
¢ó
¡ô
¢ô
¡õ
¢õ
¡ö
¢ö
¡÷
¢÷
¡ø
¢ø
¡ù
¢ù
¡ú
¡û
¡ü
¡ý
¡þ
¢þ
¢¡

I also set up $phpdig_words_chars for EUC-JP and Shift_JIS:

$phpdig_words_chars['EUC-JP'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…*‡ˆ‰*‹ŒŽ ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°± ³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ× ØÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûü ';
$phpdig_words_chars['Shift_JIS'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…*‡ˆ‰*‹ŒŽ ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±² ³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ× ÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúû';

Does it seem OK?

Charter
01-11-2004, 01:48 PM
$phpdig_words_chars['EUC-JP'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…*‡ˆ‰*‹ŒŽ ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°± ³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ× ØÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûü ';

$phpdig_words_chars['Shift_JIS'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…*‡ˆ‰*‹ŒŽ ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±² ³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ× ÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúû';


There is a range of characters from ~ to _ that is unassigned so if these encodings do not use these characters, you might try the following.

$phpdig_words_chars['EUC-JP'] = '[:alnum:]_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅ ÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçèéêë ì*îïðñòóôõö÷øùúûüý';

$phpdig_words_chars['Shift_JIS'] = '[:alnum:]_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅ ÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçèéêë ì*îïðñòóôõö÷øùúû';

If these encodings do use the characters bewteen ~ and _ then you might try the following.

$phpdig_words_chars['EUC-JP'] = '[:alnum:]€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š› žŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁ ÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæç éêëì*îïðñòóôõö÷øùúûüý';

$phpdig_words_chars['Shift_JIS'] = '[:alnum:]€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š›œ žŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁ ÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçè éêëì*îïðñòóôõö÷øùúû';

In either case though the Latin letters shouldn't need to be included as [:alnum:] takes care of this. Of course, don't allow "bad" characters.

Edomondo
01-12-2004, 01:44 PM
Originally posted by Charter
If these encodings do use the characters bewteen ~ and _ then you might try the following.
$phpdig_words_chars['EUC-JP'] = '[:alnum:]€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š› žŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁ ÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæç éêëì*îïðñòóôõö÷øùúûüý';

$phpdig_words_chars['Shift_JIS'] = '[:alnum:]€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š›œ žŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁ ÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçè éêëì*îïðñòóôõö÷øùúû';

Thank you Charter!

These encodings actually use characters from ~ to _.
This is how I built these strings:
EUC-JP uses character from ASCII 64 to 254 except 127 and 142.
Shift_JIS uses character from ASCII 64 to 252 except 127.
These characters are either in first or second position in multibyte characters.

What do you mean by "bad character"?

BTW, I found the full version of Jcode (the previous one was a Light Edition) at http://www.spencernetwork.org/jcode/.
It has support for UTF-8, but I haven't been able to make it work. Each characters are replaced by ? in the source. I think it must come from an incompatibility in the server. That doesn't work locally neither (on an apache emulator)

Apparently, all what I need to do to index multi-byte words is to change phpdigEpureText for the Japanese encodings to replace the separators by a space and to convert the text to the good encoding.
I haven't tried to implement it in phpdig though, but phpdigEpureText was why no word was indexed in the Japanese pages. I tested this function alone and it returned less than 2 letter words for a Japanese input.

Edomondo
01-14-2004, 05:16 AM
I'm now facing another problem in striping strings.
ex: ¥³¡¼¥Ê¡¼¤â¤¢¤ê¤Þ¤¹ (in EUC-JP)
is made of 9 mutli-byte characters:
¥³ ¡¼ ¥Ê ¡¼ ¤â ¤¢ ¤ê ¤Þ ¤¹
But the script I wrote considers ¢¤ as a separator and replace it by a space, though it is the end of ¤¢ and the beginning of ¤ê (2 multi-byte characters).
The script returns: ¥³¡¼¥Ê¡¼¤â¤ ê¤Þ¤¹. What result in the end of the string being nosense.

The only way to get rid of this bug would be to check each 2 characters to see if it is a mutli-byte character or not, and replace it by a space if it is a separator. But such a script wouldn't be too much time-consuming? Any idea on how to achieve this?

Charter
01-14-2004, 07:43 AM
Hi. Perhaps try the following:

<?php
$string = "¥³¡¼¥Ê¡¼¤â¤¢¤ê¤Þ¤¹AND¼¤â¢¤ê¤Þ¤¹";
$string = eregi_replace("([^¤])¢¤","\\\\1 ",$string);
echo $string;
// returns ¥³¡¼¥Ê¡¼¤â¤¢¤ê¤Þ¤¹AND¼¤â ê¤Þ¤¹
?>

Also, you may find this (http://www.cs.tut.fi/~jkorpela/chars.html) tutorial helpful.

Edomondo
01-15-2004, 05:07 AM
Hi. Thank you for the help, I'll test the code you submitted. Though it might not work on some rare cases (when ¤ is actually the character before ¢¤) I think I'll go for it.

BTW, other search engines are based on a dictionnary for mutli-byte encodings. The dictionnary is a txt file that contains a word per line. The script extract the longest matching word from the page text and index it.
My question is: Would it be possible to implement such a dictionnary tool in phpdig?
If so, I would be happy to build a Japanese dictionnary.

Charter
01-15-2004, 07:06 AM
Hi. There is a ¤¢¤ combo in the $string variable where ¢¤ is not replaced with a space. Did you mean something else?

>> The script extract the longest matching word from the page text and index it.

With the mutli-byte dictionary, is it that only the longest matching word from a page gets indexed?

Edomondo
01-15-2004, 08:26 AM
Originally posted by Charter
Hi. There is a ¤¢¤ combo in the $string variable where ¢¤ is not replaced with a space. Did you mean something else?

Errrr... I'm not sure I understand.
The script you submitted use a regular expression to prevent replacing ¢¤ if the before is ¤, right?
I meant, in the case where the character before ¢¤ is really a multi-byte character ending with ¤, ¢¤ is not replaced. But I think this has a few chance to happen.

Originally posted by Charter
>> The script extract the longest matching word from the page text and index it.

With the mutli-byte dictionary, is it that only the longest matching word from a page gets indexed?

No of course, it will extract all the words comparing the page content with the longest words first. Ex : in English, it wouldn't extract "nation" from "internationalization" if "internationalization" is in the dictionnary.
But the dictionnary must be as complete as possible to do a good job.
Can it be integrated to phpdig?

Charter
01-15-2004, 01:25 PM
Hi. Try using mb_eregi_replace (http://www.php.net/manual/en/function.mb-eregi-replace.php) in place of eregi_replace, but note that some of the PHP multi-byte functions are experimental.

As for a dictionary, you might try the following. In spider.php add:

$my_dictionary = phpdigComWords("$relative_script_path/includes/my_dictionary.ext");

after the following:

$common_words = phpdigComWords("$relative_script_path/includes/common_words.txt");

In robot_functions.php replace:

if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^[0-9a-zßðþ]',$key))

with the following:

if (mb_strlen($key) > SMALL_WORDS_SIZE and mb_strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and isset($my_dictionary[$key]) and mb_ereg('^['.$phpdig_words_chars[PHPDIG_ENCODING].']',$key))

Also apply any other changes given in this (http://www.phpdig.net/showthread.php?threadid=275) thread and use multi-byte (http://www.php.net/manual/en/ref.mbstring.php) functions in place of their single-byte counterparts.

The thing is, of course, to make sure that things that were treated as single-byte and now treated as multi-byte. The $phpdig_words_chars and $phpdig_string_subst variables may need to be treated differently too, so that the characters are seen as multi-byte rather than single-byte.

PhpDig was originally written for single-byte use. In theory it seems that it might be able to be converted to multi-byte use, but in practice it's going to take time, tweaking, and in the end hopefully it works.

Edomondo
01-17-2004, 03:48 AM
Hi. Thank you Charter for your help.
It is working except that there is still a problem.
There are no spaces in multi-byte characters encodings. That's why we use a dictionnary that contains all the words of a language to extract words from the text.

If it were in english, the phrase:
"NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheHu bbleSpaceTelescope."
would be splitted using a dictionnary containing:
"nasa
announced
yesterday
cancelling
space
shuttle
servicing
missions
hubble
space
telescope
..."

It must also finds the longest words in priority (for example finds the word "yesterday" before the word "day")

I tried to use function strstr(), but I haven't succeeded. Can anyone help me?

Charter
01-18-2004, 08:45 AM
Hi. Without seeing the code, you might try using a multi-byte (http://www.php.net/manual/en/ref.mbstring.php) function in place of the strstr function.

Edomondo
01-21-2004, 05:42 AM
Thank you. So, mb_strpos() seems to be a more sensitive choice, but the server where the search engine is hosted for testings doesn't have the multi-byte functions set on, so I can't check it :-(

But if the dictionary is made correctly, there shouldn't be any problem when using non multi-byte functions.

If i use my previous example, it should index all the word from the dictionnary found in the string and take them out of it. At the end, "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheHu bbleSpaceTelescope." would become"itis all tothe".

The string of unfound word can also be used to define common words or to expand the dictionnary.

What is the quickest and smartest way to achieve this?

Charter
01-22-2004, 01:30 PM
Hi. It sounds like "itis all tothe" could be written to a file of common words, but as for writing words to a common file versus a dictionary file, that seems to need yet another dicitonary so that a script could determine what file to write to, unless you were to set some sort of parameter such as length/pattern/etcetera so the script would know where to write the phrase.

Edomondo
01-27-2004, 06:04 AM
I didn't mean using the unfound words for excluded words automatically, but it may be helpful to see which one are not part of the dictionnary and update it if necessary.

However Japanese is not as easy as English and it is just impossible to set a list of words to exclude.

I'm still trying to break a sentence into different words using a dictionnary. But my main concerns is about the quickness of the script.

Any script are welcomed ! ;-)