View Single Post
Old 11-23-2004, 07:36 AM   #1
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Using a dictionnary to spider pages

In post http://www.phpdig.net/forum/showthread.php?t=355 about spidering mutli-byte encodings, we found out that the only way to tokenize a text of a language that doesn't have word separators is to use a dictionnary.

Such a dictionnary would contains every word of the language, so it will a very huge file. Each word from the original text must be extracted using the dictionnary to be stored as a keyword.

I tried to develop such a function, but I'm afraid it's not fast enough. In the example below I used an text in English and removed spaces.

PHP Code:
$text "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheHubbleSpaceTelescope.";

$dico = array();

$dico[1] = "nasa";
$dico[2] = "announced";
$dico[3] = "yesterday";
$dico[4] = "cancelling";
$dico[5] = "space";
$dico[6] = "shuttle";
$dico[7] = "servicing";
$dico[8] = "missions";
$dico[9] = "hubble";
$dico[10] = "telescope";
$dico[11] = "other";
$dico[12] = "words";
$dico[13] = "here";

for (
$j 0$j <= strlen($text); $j++)
    {
    for (
$i 1$i <= count($dico); $i++)
        {
        if (
strtolower(substr($text$jstrlen($dico[$i]))) == strtolower($dico[$i]))
            {
            echo 
$dico[$i]." ";
            break;
            }
        }
    } 
Each word are displayed with a space between them.

Can anyone help me or give me advises on how to speed up or improve the function?
Edomondo is offline   Reply With Quote