View Single Post
Old 03-30-2004, 11:32 PM   #1
cybercox
Green Mole
 
Join Date: Jan 2004
Location: Italy
Posts: 11
Bug when spidering subdomains

Hi charter!
I have found the following bug:

1) I spider a site (example: http://www.jobnetwork.it/foodir/foo.htm) that contains a link to a subdomain. The link must consist in hostname onliy. In the example you will find a link to http://piemonte.jobnetwork.it

2) The spider finds the link and since i have define('PHPDIG_IN_DOMAIN',true); adds it to the tempspider and sites tables. Actually it add the site correctly but in tempspider adds the path of the current page. In this case adds http://piemonte.jobnetwork.it/foodir/

3) I have done some tweaking on it and found that in robot_functions.php in phpdigExplore function:


if (substr($regs[8],0,1) == "/") {
$links[$index] = phpdigRewriteUrl($regs[8]);
}
else{
$links[$index] = phpdigRewriteUrl($path.$regs[8]);
}

the "else" is executed when we don't have any path-filename in the link, so if i link to http://subdomain.jobnetwork.it the current path is added to the link!

My solution is the following:

if (substr($regs[8],0,1) == "/") {
$links[$index] = phpdigRewriteUrl($regs[8]);
}
elseif($regs[5]=="" or $url == 'http://'.$regs[5].'/'){
// we are in the same host or the host information is not provided
$links[$index] = phpdigRewriteUrl($path.$regs[8]);
}elseif ($regs[5] != "" && $url != 'http://'.$regs[5].'/') {
// host information is provided but we are not in the same host
$links[$index] = phpdigRewriteUrl($regs[8]);
}


Charter what do you think? I don't know if the solution is good, if it is conservative to the other links or not....
Regards
Simone Capra

capra__nospam__@erweb.it
E.R.WEB - s.r.l.
http://www.erweb.it
cybercox is offline   Reply With Quote