![]() |
Using a dictionnary to spider pages
In post http://www.phpdig.net/forum/showthread.php?t=355 about spidering mutli-byte encodings, we found out that the only way to tokenize a text of a language that doesn't have word separators is to use a dictionnary.
Such a dictionnary would contains every word of the language, so it will a very huge file. Each word from the original text must be extracted using the dictionnary to be stored as a keyword. I tried to develop such a function, but I'm afraid it's not fast enough. In the example below I used an text in English and removed spaces. PHP Code:
Can anyone help me or give me advises on how to speed up or improve the function? |
All times are GMT -8. The time now is 12:14 PM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.