PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Troubleshooting (http://www.phpdig.net/forum/forumdisplay.php?f=22)
-   -   Duplicate Documents Problem... (http://www.phpdig.net/forum/showthread.php?t=242)

vonbrocklin 11-25-2003 11:09 AM

Duplicate Documents Problem...
 
For some reason, when I run the spider it is kicking back duplicate documents that are not in fact duplicates.

It indexes this:
Code:

mambo104/index.php?option=com_weblinks&Itemid=4
But then kicks this back as a duplicate:

Code:

mambo104/index.php?option=com_weblinks&Itemid=1&catid=2
The first is actually the top level intro page leading into the second page. Both of them should be indexed because they contain different content. Is this due to a problem with the querystring somehow? Just exactly how does phpdig determine what constitutes a duplicate?

Charter 11-25-2003 12:19 PM

Hi. It is the robot_functions.php file that determines whether a page is a duplicate, specifically the phpdigTestDouble function.

In this function, it is the following query that determines a duplicate:
PHP Code:

$query_double "SELECT spider_id FROM ".PHPDIG_DB_PREFIX."spider WHERE site_id='$site_id' AND md5 = '$md5'"

As you are crawling the same site, it is the $md5 variable that is producing the duplicate results. The $md5 variable is as follows:
PHP Code:

$md5 md5($titre_resume.$page_desc['content'].$text[$max_chunk]).'_'.$tempfilesize

Briefly, the variables in the $md5 variable are as follows:
  1. $titre_resume // page title
  2. $page_desc['content'] // meta tag description
  3. $text[$max_chunk] // last chunk of page text
  4. $tempfilesize // temp file size
As both pages are creating the same $md5 variable, they are seen as duplicates.

Making the page title dynamic, depending on the query string, or changing CHUNK_SIZE in the config file would be a couple ways to avoid the duplicates.

vonbrocklin 11-25-2003 12:49 PM

Hmmm. I'll have to look into tempfilesize. Could there be some type of bug in there?

document 1 = 3.22KB

document 2 = 3.4KB

I'm thinking the temp file size should be same as the actual file size, no? And if so I would think the different file sizes would prevent them from being tagged as dupes.

Thanks for your other suggestion regarding titles. Unfortunately I am building this a plug-in component for Mambo OS, and their titles are not dynamic out of the box. So, I need to come up with a better solution that works with the stock install of Mambo. Any more info would be appreciated... maybe I can modify the function so that it bases duplicates on the actual URL.

Charter 11-25-2003 01:16 PM

Hi. The $tempfilesize varible is created in the phpdigTempFile function in the robot_functions.php file and is set to the filesize of the temporary file. Do those two pages still show as duplicates if you increase the CHUNK_SIZE or add some amount of r****m text to the end of one of the pages?


All times are GMT -8. The time now is 10:11 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.