PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Bug Tracker (http://www.phpdig.net/forum/forumdisplay.php?f=27)
-   -   spidering in domain 2 problems (http://www.phpdig.net/forum/showthread.php?t=2013)

dhorwitz 05-18-2005 01:11 AM

spidering in domain 2 problems
 
Hi All,

Spidering a domain using PHPDIG_IN_DOMAIN I have noticed 2 problems (in phpdig-1.8.7):

1) spidering a domain (in my case a large univeristy domain) from the main institutional site results in some sites not being recognized as on domain. For instance if the search starts at:
www.uct.ac.za then web.uct.ac.za is recodnised as part of the domain while www.ched.uct.ac.za is not (ie to check domains it seems to strip the first part rather than checl the end of the domain)

2) When it encounters a new site it recorded in the temp file as at / rather than the page linked. So sites that are not searchable from the root folder don't get indexed

I'll have a look in the code and see what I can find...

David

dhorwitz 05-18-2005 02:18 AM

patch
 
Here is an updated phpdigCompareDomains that seems to fix the problem (don't know if it breaks anything else!


//=================================================
//Find if an url is same domain than another
function phpdigCompareDomains($url1,$url2) {
$url1 = parse_url($url1);
$url2 = parse_url($url2);
print $url1['host']."\n";
print $url2['host']."\n";

if (isset($url1['host']) && isset($url2['host'])
&& eregi('^([a-z0-9_-]+)\.(.+)',$url1['host'],$from_url)
&& eregi('^([a-z0-9_-]+)\.(.+)',$url2['host'],$to_url)
&& (strpos($url2['host'],$from_url[2])!==false && (strpos($url2['host'],$from_url[2])+strlen($from_url[2])==strlen($url2['host'])))) {
return true;
}
else {
return false;
// be careful setting this to true as indexing
// could take a very, VeRy, VERY looooong time
// return true;
}
}

dhorwitz 05-18-2005 04:01 AM

oops
 
got the terms back to front :-)

//=================================================
//Find if an url is same domain than another
function phpdigCompareDomains($url1,$url2) {
$url1 = parse_url($url1);
$url2 = parse_url($url2);
print $url1['host']."\n";
print $url2['host']."\n";

if (isset($url1['host']) && isset($url2['host'])
&& eregi('^([a-z0-9_-]+)\.(.+)',$url1['host'],$from_url)
&& eregi('^([a-z0-9_-]+)\.(.+)',$url2['host'],$to_url)
&& (strpos($url1['host'],$to_url[2])!==false && (strpos($url1['host'],$to_url[2])+strlen($to_url[2])==strlen($url1['host'])))) {
return true;
}
else {
return false;
// be careful setting this to true as indexing
// could take a very, VeRy, VERY looooong time
// return true;
}
}


All times are GMT -8. The time now is 08:26 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.