View Single Post
Old 02-22-2005, 06:19 AM   #1
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Auto language guesser

First, don't expect too much of that post, I've only done half part of the job (the easiest half ).

Now that PhpDig can spider multi encoding it can also spider multi language sites and you will probably want to differenciate the language of each page.

There are several tools to guess languages and happily a few of them are free!
I came across Languid: a statistical language identifier by Maciej Ceglowski (http://languid.cantbedone.org/).
It's a great tool that can guess 72 languages with big accuracy.
It is originally written in Perl (source can be found at http://search.cpan.org/~mceglows/).

I wrote a small function to guess what is the language of a text basing upon the XML API of Languid.

Now I need help to insert it to PhpDig. OK, it will slow a little bit the spidering process but this would be worthy. We would need to add this function to robot_functions.php and create a new field in spider table to store the language ID.
Or maybe can someone write a port to PHP of the original script by Maciej in Perl.

Anyone ready to give me hand on this?
Attached Files
File Type: txt guess_language.php.txt (5.5 KB, 37 views)
Edomondo is offline   Reply With Quote