PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 11-23-2004, 07:36 AM   #1
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Using a dictionnary to spider pages

In post http://www.phpdig.net/forum/showthread.php?t=355 about spidering mutli-byte encodings, we found out that the only way to tokenize a text of a language that doesn't have word separators is to use a dictionnary.

Such a dictionnary would contains every word of the language, so it will a very huge file. Each word from the original text must be extracted using the dictionnary to be stored as a keyword.

I tried to develop such a function, but I'm afraid it's not fast enough. In the example below I used an text in English and removed spaces.

PHP Code:
$text "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheHubbleSpaceTelescope.";

$dico = array();

$dico[1] = "nasa";
$dico[2] = "announced";
$dico[3] = "yesterday";
$dico[4] = "cancelling";
$dico[5] = "space";
$dico[6] = "shuttle";
$dico[7] = "servicing";
$dico[8] = "missions";
$dico[9] = "hubble";
$dico[10] = "telescope";
$dico[11] = "other";
$dico[12] = "words";
$dico[13] = "here";

for (
$j 0$j <= strlen($text); $j++)
    {
    for (
$i 1$i <= count($dico); $i++)
        {
        if (
strtolower(substr($text$jstrlen($dico[$i]))) == strtolower($dico[$i]))
            {
            echo 
$dico[$i]." ";
            break;
            }
        }
    } 
Each word are displayed with a space between them.

Can anyone help me or give me advises on how to speed up or improve the function?
Edomondo is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Index some, but spider all pages griemer Troubleshooting 0 01-16-2007 05:30 AM
Cannot spider some pages and ABSOLUTE_SCRIPT_PATH and /usr/local/bin/ paullind Troubleshooting 2 04-03-2006 08:06 AM
Spider stops before all pages are indexed halide Troubleshooting 3 07-19-2005 12:26 AM
Spider indexes cgi pages but not its links!? WebSpider Troubleshooting 3 02-08-2005 06:04 PM
Set time limit on spider.php or number of pages paullind Troubleshooting 1 05-01-2004 07:25 AM


All times are GMT -8. The time now is 08:40 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.