PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Bug Tracker (http://www.phpdig.net/forum/forumdisplay.php?f=27)
-   -   Bug when spidering subdomains (http://www.phpdig.net/forum/showthread.php?t=758)

cybercox 03-30-2004 11:32 PM

Bug when spidering subdomains
 
Hi charter!
I have found the following bug:

1) I spider a site (example: http://www.jobnetwork.it/foodir/foo.htm) that contains a link to a subdomain. The link must consist in hostname onliy. In the example you will find a link to http://piemonte.jobnetwork.it

2) The spider finds the link and since i have define('PHPDIG_IN_DOMAIN',true); adds it to the tempspider and sites tables. Actually it add the site correctly but in tempspider adds the path of the current page. In this case adds http://piemonte.jobnetwork.it/foodir/

3) I have done some tweaking on it and found that in robot_functions.php in phpdigExplore function:


if (substr($regs[8],0,1) == "/") {
$links[$index] = phpdigRewriteUrl($regs[8]);
}
else{
$links[$index] = phpdigRewriteUrl($path.$regs[8]);
}

the "else" is executed when we don't have any path-filename in the link, so if i link to http://subdomain.jobnetwork.it the current path is added to the link!

My solution is the following:

if (substr($regs[8],0,1) == "/") {
$links[$index] = phpdigRewriteUrl($regs[8]);
}
elseif($regs[5]=="" or $url == 'http://'.$regs[5].'/'){
// we are in the same host or the host information is not provided
$links[$index] = phpdigRewriteUrl($path.$regs[8]);
}elseif ($regs[5] != "" && $url != 'http://'.$regs[5].'/') {
// host information is provided but we are not in the same host
$links[$index] = phpdigRewriteUrl($regs[8]);
}


Charter what do you think? I don't know if the solution is good, if it is conservative to the other links or not....
Regards
Simone Capra

capra__nospam__@erweb.it
E.R.WEB - s.r.l.
http://www.erweb.it

Charter 03-31-2004 11:48 PM

Hi. Good eye! Yes, I see the problem when a link like http://sub.domain.com is encountered without an ending slash. Untested, but an alternative solution might be the following:
PHP Code:

if (($regs[5] != "") && ($regs[8] == "")) {
     
$links[$index] = array("path" => """file" => "");
}
elsif (substr($regs[8],0,1) == "/") {
     
$links[$index] = phpdigRewriteUrl($regs[8]);
}
else {
     
$links[$index] = phpdigRewriteUrl($path.$regs[8]);




All times are GMT -8. The time now is 08:56 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.