Hi charter!
I have found the following bug:
1) I spider a site (example:
http://www.jobnetwork.it/foodir/foo.htm) that contains a link to a subdomain. The link must consist in hostname onliy. In the example you will find a link to
http://piemonte.jobnetwork.it
2) The spider finds the link and since i have define('PHPDIG_IN_DOMAIN',true); adds it to the tempspider and sites tables. Actually it add the site correctly but in tempspider adds the path of the current page. In this case adds
http://piemonte.jobnetwork.it/foodir/
3) I have done some tweaking on it and found that in robot_functions.php in phpdigExplore function:
if (substr($regs[8],0,1) == "/") {
$links[$index] = phpdigRewriteUrl($regs[8]);
}
else{
$links[$index] = phpdigRewriteUrl($path.$regs[8]);
}
the "else" is executed when we don't have any path-filename in the link, so if i link to
http://subdomain.jobnetwork.it the current path is added to the link!
My solution is the following:
if (substr($regs[8],0,1) == "/") {
$links[$index] = phpdigRewriteUrl($regs[8]);
}
elseif($regs[5]=="" or $url == 'http://'.$regs[5].'/'){
// we are in the same host or the host information is not provided
$links[$index] = phpdigRewriteUrl($path.$regs[8]);
}elseif ($regs[5] != "" && $url != 'http://'.$regs[5].'/') {
// host information is provided but we are not in the same host
$links[$index] = phpdigRewriteUrl($regs[8]);
}
Charter what do you think? I don't know if the solution is good, if it is conservative to the other links or not....
Regards
Simone Capra
capra__nospam__@erweb.it
E.R.WEB - s.r.l.
http://www.erweb.it