PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Mod Requests (http://www.phpdig.net/forum/forumdisplay.php?f=23)
-   -   Spidering sub-directories as the root (http://www.phpdig.net/forum/showthread.php?t=1049)

bloodjelly 07-08-2004 06:19 PM

Spidering sub-directories as the root
 
I'm interested in getting the spider function, not just the search function, to treat subdirectories of URLs as the root.

For example, if someone wanted to spider http://www.geocities.com/website as its own site, without scanning the true root (www.geocities.com).

So far I changed this bit of code in robot_functions.php:
PHP Code:

$url $pu['scheme']."://".$pu['host']."/"

to this:
PHP Code:

    $url $pu['scheme']."://".$pu['host'];
    if (isset(
$pu['path'])) {
        
$url .= $pu['path']."/";
    }
    else {
        
$url .= "/";
    } 

and this:
PHP Code:

$subpu phpdigRewriteUrl($pu['path']."?".$pu['query']); 

to this:
PHP Code:

    if (isset($pu['path'])) {
        
$subpu phpdigRewriteUrl("?".$pu['query']);
    }
    else {
        
$subpu phpdigRewriteUrl($pu['path']."?".$pu['query']);
    } 

which made the end directory store correctly in the table, but I get a 0 links found message. Has anyone tried to do this yet? I'm not sure if I'm on the right track. Thanks.

caco3 07-10-2004 01:01 PM

hello bloodjelly

I have the same problem, and i solved it with adding this code:

PHP Code:

///Modifikation 2004 by George Ruinelli //////////////////////////
if($link['url']=="http://www.domain.ch/") {
  
$pos1=strpos("_".$link['path'],"subdir/");
  
$pos2=strpos("_".$link['file'],"subdir/");
  
//if($pos!=1 AND $pos!=2){ 
  
if($pos1==false AND $pos2==false){ //text nicht gefunden
    
$link['ok'] = 0;
  }


in the file robot_functions.php at the end of the function phpdigDetectDir but before
PHP Code:

if (!$link['ok'] && isset($status)) {
    
$link['status'] = $status['status'];
    
$link['host'] =   $status['host'];
    
$link['path'] =   $status['path'];
    
$link['cookies'] = $status['cookies'];


My code prevents phpdig adding a link who isn't in this subdir to it's list

bloodjelly 07-12-2004 12:39 PM

Thanks for the help, caco, but what I need is a mod that adds links to the database exactly as entered, either with a subdirectory or not. In other words, if I wanted to spider "http://www.mysite.com/directory" as a root, I could do it, and if I wanted to spider "http://www.mysite.com" as a root I could do that too.

Charter 07-12-2004 05:27 PM

Hi. Perhaps upgrade to PhpDig version 1.8.2... :D

bloodjelly 07-12-2004 05:39 PM

You are awesome.

Charter 07-14-2004 08:04 PM

FYI: version 1.8.3 released to allow for the 'limit to directory' option to be consistent across other control panel options, among other changes.

bloodjelly 07-15-2004 06:54 PM

Hi charter -

I'm not sure if I'm using the limit to directory feature correctly (I have it set to "true") but when I enter a website (www.geocities.com/psychology_x/main.html for example) it spiders correctly, but the listing in the "sites" table is only for geocities. Is there a way to make each separate directory treated as its own site? Or am I missing something? Thanks.

Charter 07-15-2004 08:05 PM

Hi. The issue is that foo.com/bar/ is not a separate domain from foo.com/ but rather a subdirectory of that domain. Spidering can now be limited to subdirectories, but the domain is still the domain. On the other hand, the bar.foo.com/ subdomain, while it can point to the foo.com/bar/ subdirectory, it is a third level domain and can also be treated as a separate site on a separate server. The database storage scheme is domain based, and that is why subdirectories are not stored separately but subdomains are separately stored.

bloodjelly 07-15-2004 08:12 PM

Got it. Thanks for the explaination.


All times are GMT -8. The time now is 07:53 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.