Hi. It is the robot_functions.php file that determines whether a page is a duplicate, specifically the phpdigTestDouble function.
In this function, it is the following query that determines a duplicate:
PHP Code:
$query_double = "SELECT spider_id FROM ".PHPDIG_DB_PREFIX."spider WHERE site_id='$site_id' AND md5 = '$md5'";
As you are crawling the same site, it is the $md5 variable that is producing the duplicate results. The $md5 variable is as follows:
PHP Code:
$md5 = md5($titre_resume.$page_desc['content'].$text[$max_chunk]).'_'.$tempfilesize;
Briefly, the variables in the $md5 variable are as follows:
- $titre_resume // page title
- $page_desc['content'] // meta tag description
- $text[$max_chunk] // last chunk of page text
- $tempfilesize // temp file size
As both pages are creating the same $md5 variable, they are seen as duplicates.
Making the page title dynamic, depending on the query string, or changing CHUNK_SIZE in the config file would be a couple ways to avoid the duplicates.