PDA

View Full Version : Spidering sub-directories as the root


bloodjelly
07-08-2004, 07:19 PM
I'm interested in getting the spider function, not just the search function, to treat subdirectories of URLs as the root.

For example, if someone wanted to spider http://www.geocities.com/website as its own site, without scanning the true root (www.geocities.com).

So far I changed this bit of code in robot_functions.php:
$url = $pu['scheme']."://".$pu['host']."/";
to this:
$url = $pu['scheme']."://".$pu['host'];
if (isset($pu['path'])) {
$url .= $pu['path']."/";
}
else {
$url .= "/";
}
and this:
$subpu = phpdigRewriteUrl($pu['path']."?".$pu['query']);
to this:
if (isset($pu['path'])) {
$subpu = phpdigRewriteUrl("?".$pu['query']);
}
else {
$subpu = phpdigRewriteUrl($pu['path']."?".$pu['query']);
}
which made the end directory store correctly in the table, but I get a 0 links found message. Has anyone tried to do this yet? I'm not sure if I'm on the right track. Thanks.

caco3
07-10-2004, 02:01 PM
hello bloodjelly

I have the same problem, and i solved it with adding this code:


///Modifikation 2004 by George Ruinelli //////////////////////////
if($link['url']=="http://www.domain.ch/") {
$pos1=strpos("_".$link['path'],"subdir/");
$pos2=strpos("_".$link['file'],"subdir/");
//if($pos!=1 AND $pos!=2){
if($pos1==false AND $pos2==false){ //text nicht gefunden
$link['ok'] = 0;
}
}


in the file robot_functions.php at the end of the function phpdigDetectDir but before

if (!$link['ok'] && isset($status)) {
$link['status'] = $status['status'];
$link['host'] = $status['host'];
$link['path'] = $status['path'];
$link['cookies'] = $status['cookies'];
}

My code prevents phpdig adding a link who isn't in this subdir to it's list

bloodjelly
07-12-2004, 01:39 PM
Thanks for the help, caco, but what I need is a mod that adds links to the database exactly as entered, either with a subdirectory or not. In other words, if I wanted to spider "http://www.mysite.com/directory" as a root, I could do it, and if I wanted to spider "http://www.mysite.com" as a root I could do that too.

Charter
07-12-2004, 06:27 PM
Hi. Perhaps upgrade to PhpDig version 1.8.2... :D

bloodjelly
07-12-2004, 06:39 PM
You are awesome.

Charter
07-14-2004, 09:04 PM
FYI: version 1.8.3 released to allow for the 'limit to directory' option to be consistent across other control panel options, among other changes.

bloodjelly
07-15-2004, 07:54 PM
Hi charter -

I'm not sure if I'm using the limit to directory feature correctly (I have it set to "true") but when I enter a website (www.geocities.com/psychology_x/main.html for example) it spiders correctly, but the listing in the "sites" table is only for geocities. Is there a way to make each separate directory treated as its own site? Or am I missing something? Thanks.

Charter
07-15-2004, 09:05 PM
Hi. The issue is that foo.com/bar/ is not a separate domain from foo.com/ but rather a subdirectory of that domain. Spidering can now be limited to subdirectories, but the domain is still the domain. On the other hand, the bar.foo.com/ subdomain, while it can point to the foo.com/bar/ subdirectory, it is a third level domain and can also be treated as a separate site on a separate server. The database storage scheme is domain based, and that is why subdirectories are not stored separately but subdomains are separately stored.

bloodjelly
07-15-2004, 09:12 PM
Got it. Thanks for the explaination.