PDA

View Full Version : Bug when spidering subdomains


cybercox
03-30-2004, 11:32 PM
Hi charter!
I have found the following bug:

1) I spider a site (example: http://www.jobnetwork.it/foodir/foo.htm) that contains a link to a subdomain. The link must consist in hostname onliy. In the example you will find a link to http://piemonte.jobnetwork.it

2) The spider finds the link and since i have define('PHPDIG_IN_DOMAIN',true); adds it to the tempspider and sites tables. Actually it add the site correctly but in tempspider adds the path of the current page. In this case adds http://piemonte.jobnetwork.it/foodir/

3) I have done some tweaking on it and found that in robot_functions.php in phpdigExplore function:


if (substr($regs[8],0,1) == "/") {
$links[$index] = phpdigRewriteUrl($regs[8]);
}
else{
$links[$index] = phpdigRewriteUrl($path.$regs[8]);
}

the "else" is executed when we don't have any path-filename in the link, so if i link to http://subdomain.jobnetwork.it the current path is added to the link!

My solution is the following:

if (substr($regs[8],0,1) == "/") {
$links[$index] = phpdigRewriteUrl($regs[8]);
}
elseif($regs[5]=="" or $url == 'http://'.$regs[5].'/'){
// we are in the same host or the host information is not provided
$links[$index] = phpdigRewriteUrl($path.$regs[8]);
}elseif ($regs[5] != "" && $url != 'http://'.$regs[5].'/') {
// host information is provided but we are not in the same host
$links[$index] = phpdigRewriteUrl($regs[8]);
}


Charter what do you think? I don't know if the solution is good, if it is conservative to the other links or not....
Regards
Simone Capra

capra__nospam__@erweb.it
E.R.WEB - s.r.l.
http://www.erweb.it

Charter
03-31-2004, 11:48 PM
Hi. Good eye! Yes, I see the problem when a link like http://sub.domain.com is encountered without an ending slash. Untested, but an alternative solution might be the following:

if (($regs[5] != "") && ($regs[8] == "")) {
$links[$index] = array("path" => "", "file" => "");
}
elsif (substr($regs[8],0,1) == "/") {
$links[$index] = phpdigRewriteUrl($regs[8]);
}
else {
$links[$index] = phpdigRewriteUrl($path.$regs[8]);
}