PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   How-to Forum (http://www.phpdig.net/forum/forumdisplay.php?f=33)
-   -   Subdirectories but not higher directories (http://www.phpdig.net/forum/showthread.php?t=1749)

td234 01-17-2005 03:38 PM

Subdirectories but not higher directories
 
I am spidering a site that I have no control over. I would like to spider a single directory on that site (same /dir1). I discovered that I must set LIMIT_TO_DIRECTORY to false because under /dir1 the content is in subdirectories. The problem I am having is that by setting LIMIT_TO_DIRECTORY to false, it also climbs out of the current directory (/dir1). Is there a method to start in one directory and to traverse it throughly (including subdirectories) but not climb out?

Thanks for a great tool!

T

vinyl-junkie 01-17-2005 04:59 PM

Have a look at this thread:
http://www.phpdig.net/forum/showthread.php?t=363

:)

td234 01-17-2005 08:01 PM

Yes, I read that when I was researching an answer to my question. Since I do not have control over the site I am spidering I cannot use the robots.txt suggestion and the balance of that thread was concerning getting search results that only contain a certain directory. I am trying to limit my crawl to a specified directory (and its subdirectories). I guess I am a little surprized that LIMIT_TO_DIRECTORY allows the crawler to crawl out of a directory. Doesn't this defeat adding a path to the URL to crawl?

Charter 01-18-2005 03:41 AM

You need to set LIMIT_TO_DIRECTORY back to true and then enter links in the textbox like so:
Code:

http://www.domain.com/dir1/
http://www.domain.com/dir1/sub1/
http://www.domain.com/dir1/sub2/

With LIMIT_TO_DIRECTORY set to true, PhpDig indexes that (sub)directory only; no (sub)directories of (sub)directories are indexed.

td234 01-18-2005 05:34 AM

Thanks Charter. My delima with your suggestion is that I do not know all the sub-directories, so I can not manually enter them. They could also change, go away, or be added, so I really need the crawling feature. I think my only solution is to modify the code and only crawl links that contain the original path. That way I will only crawl down, not up. This should be a simple comparison.

Is it only me, or does anyone else feel that crawling up after being given a path is unexpected behavior?

T

Charter 01-18-2005 05:40 AM

It only crawls up if LIMIT_TO_DIRECTORY is set to false. Look at FORBIDDEN_EXTENSIONS in section 4.3 of the documentation and limit that way instead.

td234 01-18-2005 06:20 AM

Correct me if I am wrong (which is quite possible), but it does not crawl down into subdirectories unless LIMIT_TO_DIRECTORY is set to false. I have changed it back to false and it does not find any results because links to go sub-directories.

Are you suggesting adding "\.\." to the FORBIDDEN_EXTENTIONS? I am not sure that would work on full URLs.

Thanks for taking the time to respond to my posts!

T

Charter 01-18-2005 08:59 AM

LIMIT_TO_DIRECTORY does just that, limit to that directory, no up, no down, just within that directory. Set it to true to stay within a directory, set it to false to traverse up and down. You can use FORBIDDEN_EXTENTIONS by setting a regex to exclude links, depending on the regex, but a double dot is not sufficient. If the link matches what's in FORBIDDEN_EXTENSIONS, then it won't be indexed. You can also choose a site, click the update button, and then click the no way symbol to delete a directory and exclude it from future index.

td234 01-18-2005 10:37 AM

Man, I can not tell you how much I appreciate that you take the time to reply. I know you are busy. Thanks a ton.

I have been working on my problem all morning. My problem is I want to get all the travel articles from some really huge sites and I can not afford to spider the whole site and the exclude them with the "no way". I really need to try and make this work where it only traverses "down" if a path is given.

I have been traversing your code all morning. I am figuring that it will be easiest to set a variable (say $parent_path) when the first (level==0) entry is made in the tempspider table. Then in spider.php right before indexing on line 513 I could do this test if level>0.

if($level==0) {
$parent_path = $temp_path;
}
if(isset($parent_path)) {
$parent_path = preg_replace("/\//","\/",$parent_path);
$parent_match = "/^$parent_path/";
if(!preg_match($parent_match, $temp_path)) { // if parent path not in current path, do not index
$ok_for_index=0;
}
}

You know your code more intimately than I do. Can you give me some feedback on my proposed solution and if this is is a good way to do this.

Thanks.

T

td234 01-18-2005 11:40 AM

The above does not seem to work. Entries still get put in tempspider that are above the original directory. Will continue working and post again.

Charter 01-18-2005 11:46 AM

Nah, it's easier than that...

In robot_functions.php find:
PHP Code:

if($link['path'] != $add_include['in_path']) {
    
$link['ok'] = 0;


And replace with:
PHP Code:

if(!eregi("^".$add_include['in_path'],$link['path'])) {
    
$link['ok'] = 0;



td234 01-18-2005 12:55 PM

YES! That is exactly what I needed.

Just so this thread has all the information needed for future reference, I think it should be note that LIMIT_TO_DIRECTORY needs to set to true for this to work correctly.

I know you probably can not make a change like that to the code after so many people are using it, but I think that is how most would expect it to function if passed a URl with a directory. Big improvement I think, but I am bias because that is what I needed. ;>)

Thanks again for a great program.

T

chris614164 02-08-2005 02:31 PM

why has this tweak not been included in the phpdig package, it's really great but now every time i want to update phpdig i have to remember to apply the fix, so i will visit this page until the end of my life, hmmm... :deer:

td234 02-08-2005 02:37 PM

Sub-directories fixed
 
Actually, this fix has been added and is included in 1.8.8-rc1. Notice a new variable in the config.php called ALLOW_SUBDIRECTORIES. Set it to TRUE.

chris614164 02-08-2005 03:09 PM

Quote:

Originally Posted by td234
Actually, this fix has been added and is included in 1.8.8-rc1. Notice a new variable in the config.php called ALLOW_SUBDIRECTORIES. Set it to TRUE.

ok great sorry i have not downloaded the new one, but i will as soon as it is stable cause this tool really rocks :):)


All times are GMT -8. The time now is 08:10 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.