PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 11-25-2003, 11:09 AM   #1
vonbrocklin
Green Mole
 
Join Date: Sep 2003
Posts: 5
Duplicate Documents Problem...

For some reason, when I run the spider it is kicking back duplicate documents that are not in fact duplicates.

It indexes this:
Code:
mambo104/index.php?option=com_weblinks&Itemid=4
But then kicks this back as a duplicate:

Code:
mambo104/index.php?option=com_weblinks&Itemid=1&catid=2
The first is actually the top level intro page leading into the second page. Both of them should be indexed because they contain different content. Is this due to a problem with the querystring somehow? Just exactly how does phpdig determine what constitutes a duplicate?
vonbrocklin is offline   Reply With Quote
Old 11-25-2003, 12:19 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. It is the robot_functions.php file that determines whether a page is a duplicate, specifically the phpdigTestDouble function.

In this function, it is the following query that determines a duplicate:
PHP Code:
$query_double "SELECT spider_id FROM ".PHPDIG_DB_PREFIX."spider WHERE site_id='$site_id' AND md5 = '$md5'"
As you are crawling the same site, it is the $md5 variable that is producing the duplicate results. The $md5 variable is as follows:
PHP Code:
$md5 md5($titre_resume.$page_desc['content'].$text[$max_chunk]).'_'.$tempfilesize
Briefly, the variables in the $md5 variable are as follows:
  1. $titre_resume // page title
  2. $page_desc['content'] // meta tag description
  3. $text[$max_chunk] // last chunk of page text
  4. $tempfilesize // temp file size
As both pages are creating the same $md5 variable, they are seen as duplicates.

Making the page title dynamic, depending on the query string, or changing CHUNK_SIZE in the config file would be a couple ways to avoid the duplicates.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 11-25-2003, 12:49 PM   #3
vonbrocklin
Green Mole
 
Join Date: Sep 2003
Posts: 5
Hmmm. I'll have to look into tempfilesize. Could there be some type of bug in there?

document 1 = 3.22KB

document 2 = 3.4KB

I'm thinking the temp file size should be same as the actual file size, no? And if so I would think the different file sizes would prevent them from being tagged as dupes.

Thanks for your other suggestion regarding titles. Unfortunately I am building this a plug-in component for Mambo OS, and their titles are not dynamic out of the box. So, I need to come up with a better solution that works with the stock install of Mambo. Any more info would be appreciated... maybe I can modify the function so that it bases duplicates on the actual URL.
vonbrocklin is offline   Reply With Quote
Old 11-25-2003, 01:16 PM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. The $tempfilesize varible is created in the phpdigTempFile function in the robot_functions.php file and is set to the filesize of the temporary file. Do those two pages still show as duplicates if you increase the CHUNK_SIZE or add some amount of r****m text to the end of one of the pages?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
spider documents without extensions jguert External Binaries 0 08-17-2006 07:39 AM
Documents disappear kzant Troubleshooting 7 07-30-2005 07:26 AM
Too many duplicate link, someone help please! warrence Troubleshooting 1 09-07-2004 04:26 PM
Duplicate/Similar search results? ChadK How-to Forum 3 08-20-2004 06:07 AM
'Duplicate' Search Results siliconkibou Troubleshooting 1 01-13-2004 08:00 AM


All times are GMT -8. The time now is 02:50 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.