This is an improvement to PhpDig v.1.8.8 RC1 for keeping (quasi)duplicates out of the engine table.
In robot_functions.php find:
Code:
$key = mb_ereg_replace("^([\x00-\x1f]|[\x21-\x2f]|[\x3a-\x40]|[\x5b-\x60]|[\x7b-\x7f])+","",$key); //off front only
$key = mb_ereg_replace("([\x00-\x1f]|[\x21-\x2f]|[\x3a-\x40]|[\x5b-\x60]|[\x7b-\x7f])+$","",$key); //off back only
And delete these two lines.
Also, in robot_functions.php find:
Code:
for ($token = strtok($text2, $separators); $token !== FALSE; $token = strtok($separators)) {
if (!isset($nbre_mots[$token]))
{ $nbre_mots[$token] = 1; }
else
{ $nbre_mots[$token]++; }
$total++;
}
And replace with:
Code:
for ($token = strtok($text2, $separators); $token !== FALSE; $token = strtok($separators)) {
$token = mb_ereg_replace("^([\x00-\x1f]|[\x21-\x2f]|[\x3a-\x40]|[\x5b-\x60]|[\x7b-\x7f])+","",$token); //off front only
$token = mb_ereg_replace("([\x00-\x1f]|[\x21-\x2f]|[\x3a-\x40]|[\x5b-\x60]|[\x7b-\x7f])+$","",$token); //off back only
$token = mb_strtolower(trim($token));
if (mb_strlen($token) > 0) {
if (!isset($nbre_mots[$token]))
{ $nbre_mots[$token] = 1; }
else
{ $nbre_mots[$token]++; }
$total++;
}
}
Now run the following queries, adding your table prefix to engine and engine2 if needed:
Code:
CREATE TABLE engine2 (
spider_id mediumint(9) DEFAULT '0' NOT NULL,
key_id mediumint(9) DEFAULT '0' NOT NULL,
weight smallint(4) DEFAULT '0' NOT NULL,
KEY key_id (key_id)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE utf8_general_ci;
INSERT INTO engine2 SELECT spider_id,key_id,sum(weight) AS weight FROM engine GROUP BY spider_id,key_id;
DELETE FROM engine;
INSERT INTO engine SELECT spider_id,key_id,weight FROM engine2;
DROP TABLE engine2;
If you downloaded PhpDig v.1.8.8 RC1
after the date of this post, the code changes are already included in the package.