View Single Post
Old 01-05-2004, 08:28 AM   #1
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Japanese encoding : charset=shift_jis

Hi there! PhpDig is just great!

I'm considering improving the system so that it can spider and display different encoding at the same time among which the Japanese encoding.

I've read carefully the 3 other topics dedicated to encoding issues.

I've though of 4 main points:
- There is no space in Japanese, so an algorythm can't make no difference between 2 different words. And it seems to be impossible to stock keywords in the DB. But I can think of a way to pass through it. There are 3 different types of characters in Japanese :
Hiragana (26 signs)
Katakana (26 signs)
Kanji (more than 50,000 signs)
It is possible to make a difference refering to the code of theses characters. e.g.: Katakana "re" (レ) is encoded with  ¼. if the code of the second character of the encoding is between x and y, then the character is a Katakana. Same for the others.
But some words contains different type of characters at the same time, like サボる (katakana + hiragana. means "not to attend school"), キャンプ場 (Katakana + Kanji. means "camping"), 寝る (Kanji + Hiragana, means "to sleep")
- There are different Japanese encoding: Shift_JIS, iso-2022-jp and EUC-JP. So how to crawl pages with different encodings?
- ア is the same as ア, イ as イ, ウ as ウ, カ as カ... Apart from these signs (about 50) no other matches can be done (like "*", "â" being like "a")

Sounds pretty hard, but it is not.

Any idea on how to do it? Can anyone give me hand on this?

Last edited by Edomondo; 01-05-2004 at 09:16 AM.
Edomondo is offline   Reply With Quote