PDA

View Full Version : spidering in domain 2 problems


dhorwitz
05-18-2005, 02:11 AM
Hi All,

Spidering a domain using PHPDIG_IN_DOMAIN I have noticed 2 problems (in phpdig-1.8.7):

1) spidering a domain (in my case a large univeristy domain) from the main institutional site results in some sites not being recognized as on domain. For instance if the search starts at:
www.uct.ac.za then web.uct.ac.za is recodnised as part of the domain while www.ched.uct.ac.za is not (ie to check domains it seems to strip the first part rather than checl the end of the domain)

2) When it encounters a new site it recorded in the temp file as at / rather than the page linked. So sites that are not searchable from the root folder don't get indexed

I'll have a look in the code and see what I can find...

David

dhorwitz
05-18-2005, 03:18 AM
Here is an updated phpdigCompareDomains that seems to fix the problem (don't know if it breaks anything else!


//=================================================
//Find if an url is same domain than another
function phpdigCompareDomains($url1,$url2) {
$url1 = parse_url($url1);
$url2 = parse_url($url2);
print $url1['host']."\n";
print $url2['host']."\n";

if (isset($url1['host']) && isset($url2['host'])
&& eregi('^([a-z0-9_-]+)\.(.+)',$url1['host'],$from_url)
&& eregi('^([a-z0-9_-]+)\.(.+)',$url2['host'],$to_url)
&& (strpos($url2['host'],$from_url[2])!==false && (strpos($url2['host'],$from_url[2])+strlen($from_url[2])==strlen($url2['host'])))) {
return true;
}
else {
return false;
// be careful setting this to true as indexing
// could take a very, VeRy, VERY looooong time
// return true;
}
}

dhorwitz
05-18-2005, 05:01 AM
got the terms back to front :-)

//=================================================
//Find if an url is same domain than another
function phpdigCompareDomains($url1,$url2) {
$url1 = parse_url($url1);
$url2 = parse_url($url2);
print $url1['host']."\n";
print $url2['host']."\n";

if (isset($url1['host']) && isset($url2['host'])
&& eregi('^([a-z0-9_-]+)\.(.+)',$url1['host'],$from_url)
&& eregi('^([a-z0-9_-]+)\.(.+)',$url2['host'],$to_url)
&& (strpos($url1['host'],$to_url[2])!==false && (strpos($url1['host'],$to_url[2])+strlen($to_url[2])==strlen($url1['host'])))) {
return true;
}
else {
return false;
// be careful setting this to true as indexing
// could take a very, VeRy, VERY looooong time
// return true;
}
}