View Full Version : Duplicate Documents Problem...
vonbrocklin
11-25-2003, 11:09 AM
For some reason, when I run the spider it is kicking back duplicate documents that are not in fact duplicates.
It indexes this:
mambo104/index.php?option=com_weblinks&Itemid=4
But then kicks this back as a duplicate:
mambo104/index.php?option=com_weblinks&Itemid=1&catid=2
The first is actually the top level intro page leading into the second page. Both of them should be indexed because they contain different content. Is this due to a problem with the querystring somehow? Just exactly how does phpdig determine what constitutes a duplicate?
Charter
11-25-2003, 12:19 PM
Hi. It is the robot_functions.php file that determines whether a page is a duplicate, specifically the phpdigTestDouble function.
In this function, it is the following query that determines a duplicate:
$query_double = "SELECT spider_id FROM ".PHPDIG_DB_PREFIX."spider WHERE site_id='$site_id' AND md5 = '$md5'";
As you are crawling the same site, it is the $md5 variable that is producing the duplicate results. The $md5 variable is as follows:
$md5 = md5($titre_resume.$page_desc['content'].$text[$max_chunk]).'_'.$tempfilesize;
Briefly, the variables in the $md5 variable are as follows:
$titre_resume // page title
$page_desc['content'] // meta tag description
$text[$max_chunk] // last chunk of page text
$tempfilesize // temp file size
As both pages are creating the same $md5 variable, they are seen as duplicates.
Making the page title dynamic, depending on the query string, or changing CHUNK_SIZE in the config file would be a couple ways to avoid the duplicates.
vonbrocklin
11-25-2003, 12:49 PM
Hmmm. I'll have to look into tempfilesize. Could there be some type of bug in there?
document 1 = 3.22KB
document 2 = 3.4KB
I'm thinking the temp file size should be same as the actual file size, no? And if so I would think the different file sizes would prevent them from being tagged as dupes.
Thanks for your other suggestion regarding titles. Unfortunately I am building this a plug-in component for Mambo OS, and their titles are not dynamic out of the box. So, I need to come up with a better solution that works with the stock install of Mambo. Any more info would be appreciated... maybe I can modify the function so that it bases duplicates on the actual URL.
Charter
11-25-2003, 01:16 PM
Hi. The $tempfilesize varible is created in the phpdigTempFile function in the robot_functions.php file and is set to the filesize of the temporary file. Do those two pages still show as duplicates if you increase the CHUNK_SIZE or add some amount of r****m text to the end of one of the pages?
vBulletin® v3.7.3, Copyright ©2000-2025, Jelsoft Enterprises Ltd.