PDA

View Full Version : Crawler speed improvement (although affects limit)


marco
03-23-2007, 07:14 AM
I had the problem phpdigExplore() returns to many duplicate links. This caused the spider to check 100s of duplicate URLs, which caused a slowdown, and the 1000 pages limit was hit quite fast.

Finally I added the following code at the end of phpDigExplore():

if(!$_SESSION["links"]) $_SESSION["links"]=array();
$resultlinks = array();
foreach($links as $link){
if(!array_search($link, $_SESSION["links"])){
$_SESSION["links"][]=$link;
$resultlinks[]=$link;
}
}
return $resultlinks;


I don't know whether this modification is useful or harms other components. But for the moment, it works.