PhpDig.net - Using a dictionnary to spider pages

In post http://www.phpdig.net/forum/showthread.php?t=355 about spidering mutli-byte encodings, we found out that the only way to tokenize a text of a language that doesn't have word separators is to use a dictionnary.

Such a dictionnary would contains every word of the language, so it will a very huge file. Each word from the original text must be extracted using the dictionnary to be stored as a keyword.

I tried to develop such a function, but I'm afraid it's not fast enough. In the example below I used an text in English and removed spaces.

PHP Code:


		
			
$text = "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheHubbleSpaceTelescope.";



$dico = array();



$dico[1] = "nasa";

$dico[2] = "announced";

$dico[3] = "yesterday";

$dico[4] = "cancelling";

$dico[5] = "space";

$dico[6] = "shuttle";

$dico[7] = "servicing";

$dico[8] = "missions";

$dico[9] = "hubble";

$dico[10] = "telescope";

$dico[11] = "other";

$dico[12] = "words";

$dico[13] = "here";



for ($j = 0; $j <= strlen($text); $j++)

    {

    for ($i = 1; $i <= count($dico); $i++)

        {

        if (strtolower(substr($text, $j, strlen($dico[$i]))) == strtolower($dico[$i]))

            {

            echo $dico[$i]." ";

            break;

            }

        }

    }

Each word are displayed with a space between them.

Can anyone help me or give me advises on how to speed up or improve the function?