PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   How-to Forum (http://www.phpdig.net/forum/forumdisplay.php?f=33)
-   -   Using a dictionnary to spider pages (http://www.phpdig.net/forum/showthread.php?t=1542)

Edomondo 11-23-2004 07:36 AM

Using a dictionnary to spider pages
 
In post http://www.phpdig.net/forum/showthread.php?t=355 about spidering mutli-byte encodings, we found out that the only way to tokenize a text of a language that doesn't have word separators is to use a dictionnary.

Such a dictionnary would contains every word of the language, so it will a very huge file. Each word from the original text must be extracted using the dictionnary to be stored as a keyword.

I tried to develop such a function, but I'm afraid it's not fast enough. In the example below I used an text in English and removed spaces.

PHP Code:

$text "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheHubbleSpaceTelescope.";

$dico = array();

$dico[1] = "nasa";
$dico[2] = "announced";
$dico[3] = "yesterday";
$dico[4] = "cancelling";
$dico[5] = "space";
$dico[6] = "shuttle";
$dico[7] = "servicing";
$dico[8] = "missions";
$dico[9] = "hubble";
$dico[10] = "telescope";
$dico[11] = "other";
$dico[12] = "words";
$dico[13] = "here";

for (
$j 0$j <= strlen($text); $j++)
    {
    for (
$i 1$i <= count($dico); $i++)
        {
        if (
strtolower(substr($text$jstrlen($dico[$i]))) == strtolower($dico[$i]))
            {
            echo 
$dico[$i]." ";
            break;
            }
        }
    } 

Each word are displayed with a space between them.

Can anyone help me or give me advises on how to speed up or improve the function?


All times are GMT -8. The time now is 12:14 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.