View Full Version : Auto language guesser

02-22-2005, 06:19 AM
First, don't expect too much of that post, I've only done half part of the job (the easiest half ;)).

Now that PhpDig can spider multi encoding it can also spider multi language sites and you will probably want to differenciate the language of each page.

There are several tools to guess languages and happily a few of them are free!
I came across Languid: a statistical language identifier by Maciej Ceglowski (http://languid.cantbedone.org/).
It's a great tool that can guess 72 languages with big accuracy.
It is originally written in Perl (source can be found at http://search.cpan.org/~mceglows/).

I wrote a small function to guess what is the language of a text basing upon the XML API of Languid.

Now I need help to insert it to PhpDig. OK, it will slow a little bit the spidering process but this would be worthy. We would need to add this function to robot_functions.php and create a new field in spider table to store the language ID.
Or maybe can someone write a port to PHP of the original script by Maciej in Perl.

Anyone ready to give me hand on this?

02-28-2005, 04:09 AM
Second step: set languages to your indexed pages.
Download both files attached here.
Upload them to your PhpDig admin directory.

Then add language to MySQL in spider table (add prefix if necessary):

ALTER TABLE `spider` ADD `language` CHAR(2) NOT NULL;
Log in to the admin and open find_language.php in your browser. It will go through your pages trying to guess the language of each page that don't have any language set yet. It uses the text stored in text_context. You can't use this feature if you didn’t activate the text storage in includes/config.php, set:

(if CONTENT_TEXT is set to 0, then change it to 1 and respider your sites)

This will take a while and unfortunately results are not always accurate. So you may want to set languages manually instead.
First open set_language.php in a text editor and set in the $lang_to_set array only the languages you will index. Example:

$lang_to_set = array("en", "ja", "fr"); // English, Japanese & French
FTP the page in ASCII mode to [PHPDIG_DIR]/admin and open it on your browser.
You will have the possibility to set a language to a whole site on just on subdirectories.
Each link is listed with its language value, so you can check if everything is OK.

Please keep in mind that I am far from being a powerful scripter. Many people on this forum could have done a 1000 times easier and neater code.

Don’t hesitate to post bug reports, improvements...

Next step: build the pull down menu to select the languages and change search_functions.php to support this feature.