spidering in domain 2 problems
Hi All,
Spidering a domain using PHPDIG_IN_DOMAIN I have noticed 2 problems (in phpdig-1.8.7): 1) spidering a domain (in my case a large univeristy domain) from the main institutional site results in some sites not being recognized as on domain. For instance if the search starts at: www.uct.ac.za then web.uct.ac.za is recodnised as part of the domain while www.ched.uct.ac.za is not (ie to check domains it seems to strip the first part rather than checl the end of the domain) 2) When it encounters a new site it recorded in the temp file as at / rather than the page linked. So sites that are not searchable from the root folder don't get indexed I'll have a look in the code and see what I can find... David |
patch
Here is an updated phpdigCompareDomains that seems to fix the problem (don't know if it breaks anything else!
//================================================= //Find if an url is same domain than another function phpdigCompareDomains($url1,$url2) { $url1 = parse_url($url1); $url2 = parse_url($url2); print $url1['host']."\n"; print $url2['host']."\n"; if (isset($url1['host']) && isset($url2['host']) && eregi('^([a-z0-9_-]+)\.(.+)',$url1['host'],$from_url) && eregi('^([a-z0-9_-]+)\.(.+)',$url2['host'],$to_url) && (strpos($url2['host'],$from_url[2])!==false && (strpos($url2['host'],$from_url[2])+strlen($from_url[2])==strlen($url2['host'])))) { return true; } else { return false; // be careful setting this to true as indexing // could take a very, VeRy, VERY looooong time // return true; } } |
oops
got the terms back to front :-)
//================================================= //Find if an url is same domain than another function phpdigCompareDomains($url1,$url2) { $url1 = parse_url($url1); $url2 = parse_url($url2); print $url1['host']."\n"; print $url2['host']."\n"; if (isset($url1['host']) && isset($url2['host']) && eregi('^([a-z0-9_-]+)\.(.+)',$url1['host'],$from_url) && eregi('^([a-z0-9_-]+)\.(.+)',$url2['host'],$to_url) && (strpos($url1['host'],$to_url[2])!==false && (strpos($url1['host'],$to_url[2])+strlen($to_url[2])==strlen($url1['host'])))) { return true; } else { return false; // be careful setting this to true as indexing // could take a very, VeRy, VERY looooong time // return true; } } |
All times are GMT -8. The time now is 08:26 AM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.