View Full Version : Duplicate Documents Problem...

11-25-2003, 11:09 AM
For some reason, when I run the spider it is kicking back duplicate documents that are not in fact duplicates.

It indexes this:


But then kicks this back as a duplicate:


The first is actually the top level intro page leading into the second page. Both of them should be indexed because they contain different content. Is this due to a problem with the querystring somehow? Just exactly how does phpdig determine what constitutes a duplicate?

11-25-2003, 12:19 PM
Hi. It is the robot_functions.php file that determines whether a page is a duplicate, specifically the phpdigTestDouble function.

In this function, it is the following query that determines a duplicate:

$query_double = "SELECT spider_id FROM ".PHPDIG_DB_PREFIX."spider WHERE site_id='$site_id' AND md5 = '$md5'";

As you are crawling the same site, it is the $md5 variable that is producing the duplicate results. The $md5 variable is as follows:

$md5 = md5($titre_resume.$page_desc['content'].$text[$max_chunk]).'_'.$tempfilesize;

Briefly, the variables in the $md5 variable are as follows:

$titre_resume // page title
$page_desc['content'] // meta tag description
$text[$max_chunk] // last chunk of page text
$tempfilesize // temp file size

As both pages are creating the same $md5 variable, they are seen as duplicates.

Making the page title dynamic, depending on the query string, or changing CHUNK_SIZE in the config file would be a couple ways to avoid the duplicates.

11-25-2003, 12:49 PM
Hmmm. I'll have to look into tempfilesize. Could there be some type of bug in there?

document 1 = 3.22KB

document 2 = 3.4KB

I'm thinking the temp file size should be same as the actual file size, no? And if so I would think the different file sizes would prevent them from being tagged as dupes.

Thanks for your other suggestion regarding titles. Unfortunately I am building this a plug-in component for Mambo OS, and their titles are not dynamic out of the box. So, I need to come up with a better solution that works with the stock install of Mambo. Any more info would be appreciated... maybe I can modify the function so that it bases duplicates on the actual URL.

11-25-2003, 01:16 PM
Hi. The $tempfilesize varible is created in the phpdigTempFile function in the robot_functions.php file and is set to the filesize of the temporary file. Do those two pages still show as duplicates if you increase the CHUNK_SIZE or add some amount of r****m text to the end of one of the pages?