Using a dictionnary to spider pages [Archive]

Edomondo

11-23-2004, 07:36 AM

In post http://www.phpdig.net/forum/showthread.php?t=355 about spidering mutli-byte encodings, we found out that the only way to tokenize a text of a language that doesn't have word separators is to use a dictionnary.

Such a dictionnary would contains every word of the language, so it will a very huge file. Each word from the original text must be extracted using the dictionnary to be stored as a keyword.

I tried to develop such a function, but I'm afraid it's not fast enough. In the example below I used an text in English and removed spaces.

$text = "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheHu bbleSpaceTelescope.";

$dico = array();

$dico[1] = "nasa";
$dico[2] = "announced";
$dico[3] = "yesterday";
$dico[4] = "cancelling";
$dico[5] = "space";
$dico[6] = "shuttle";
$dico[7] = "servicing";
$dico[8] = "missions";
$dico[9] = "hubble";
$dico[10] = "telescope";
$dico[11] = "other";
$dico[12] = "words";
$dico[13] = "here";

for ($j = 0; $j <= strlen($text); $j++)
{
for ($i = 1; $i <= count($dico); $i++)
{
if (strtolower(substr($text, $j, strlen($dico[$i]))) == strtolower($dico[$i]))
{
echo $dico[$i]." ";
break;
}
}
}

Each word are displayed with a space between them.

Can anyone help me or give me advises on how to speed up or improve the function?