View Single Post
Old 11-25-2003, 12:19 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. It is the robot_functions.php file that determines whether a page is a duplicate, specifically the phpdigTestDouble function.

In this function, it is the following query that determines a duplicate:
PHP Code:
$query_double "SELECT spider_id FROM ".PHPDIG_DB_PREFIX."spider WHERE site_id='$site_id' AND md5 = '$md5'"
As you are crawling the same site, it is the $md5 variable that is producing the duplicate results. The $md5 variable is as follows:
PHP Code:
$md5 md5($titre_resume.$page_desc['content'].$text[$max_chunk]).'_'.$tempfilesize
Briefly, the variables in the $md5 variable are as follows:
  1. $titre_resume // page title
  2. $page_desc['content'] // meta tag description
  3. $text[$max_chunk] // last chunk of page text
  4. $tempfilesize // temp file size
As both pages are creating the same $md5 variable, they are seen as duplicates.

Making the page title dynamic, depending on the query string, or changing CHUNK_SIZE in the config file would be a couple ways to avoid the duplicates.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote