![]() |
Bug when spidering subdomains
Hi charter!
I have found the following bug: 1) I spider a site (example: http://www.jobnetwork.it/foodir/foo.htm) that contains a link to a subdomain. The link must consist in hostname onliy. In the example you will find a link to http://piemonte.jobnetwork.it 2) The spider finds the link and since i have define('PHPDIG_IN_DOMAIN',true); adds it to the tempspider and sites tables. Actually it add the site correctly but in tempspider adds the path of the current page. In this case adds http://piemonte.jobnetwork.it/foodir/ 3) I have done some tweaking on it and found that in robot_functions.php in phpdigExplore function: if (substr($regs[8],0,1) == "/") { $links[$index] = phpdigRewriteUrl($regs[8]); } else{ $links[$index] = phpdigRewriteUrl($path.$regs[8]); } the "else" is executed when we don't have any path-filename in the link, so if i link to http://subdomain.jobnetwork.it the current path is added to the link! My solution is the following: if (substr($regs[8],0,1) == "/") { $links[$index] = phpdigRewriteUrl($regs[8]); } elseif($regs[5]=="" or $url == 'http://'.$regs[5].'/'){ // we are in the same host or the host information is not provided $links[$index] = phpdigRewriteUrl($path.$regs[8]); }elseif ($regs[5] != "" && $url != 'http://'.$regs[5].'/') { // host information is provided but we are not in the same host $links[$index] = phpdigRewriteUrl($regs[8]); } Charter what do you think? I don't know if the solution is good, if it is conservative to the other links or not.... Regards Simone Capra capra__nospam__@erweb.it E.R.WEB - s.r.l. http://www.erweb.it |
Hi. Good eye! Yes, I see the problem when a link like http://sub.domain.com is encountered without an ending slash. Untested, but an alternative solution might be the following:
PHP Code:
|
All times are GMT -8. The time now is 11:57 PM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.