View Full Version : Subdirectories but not higher directories
td234
01-17-2005, 03:38 PM
I am spidering a site that I have no control over. I would like to spider a single directory on that site (same /dir1). I discovered that I must set LIMIT_TO_DIRECTORY to false because under /dir1 the content is in subdirectories. The problem I am having is that by setting LIMIT_TO_DIRECTORY to false, it also climbs out of the current directory (/dir1). Is there a method to start in one directory and to traverse it throughly (including subdirectories) but not climb out?
Thanks for a great tool!
T
vinyl-junkie
01-17-2005, 04:59 PM
Have a look at this thread:
http://www.phpdig.net/forum/showthread.php?t=363
:)
td234
01-17-2005, 08:01 PM
Yes, I read that when I was researching an answer to my question. Since I do not have control over the site I am spidering I cannot use the robots.txt suggestion and the balance of that thread was concerning getting search results that only contain a certain directory. I am trying to limit my crawl to a specified directory (and its subdirectories). I guess I am a little surprized that LIMIT_TO_DIRECTORY allows the crawler to crawl out of a directory. Doesn't this defeat adding a path to the URL to crawl?
Charter
01-18-2005, 03:41 AM
You need to set LIMIT_TO_DIRECTORY back to true and then enter links in the textbox like so:
http://www.domain.com/dir1/
http://www.domain.com/dir1/sub1/
http://www.domain.com/dir1/sub2/
With LIMIT_TO_DIRECTORY set to true, PhpDig indexes that (sub)directory only; no (sub)directories of (sub)directories are indexed.
td234
01-18-2005, 05:34 AM
Thanks Charter. My delima with your suggestion is that I do not know all the sub-directories, so I can not manually enter them. They could also change, go away, or be added, so I really need the crawling feature. I think my only solution is to modify the code and only crawl links that contain the original path. That way I will only crawl down, not up. This should be a simple comparison.
Is it only me, or does anyone else feel that crawling up after being given a path is unexpected behavior?
T
Charter
01-18-2005, 05:40 AM
It only crawls up if LIMIT_TO_DIRECTORY is set to false. Look at FORBIDDEN_EXTENSIONS in section 4.3 of the documentation (http://www.phpdig.net/navigation.php?action=doc#toc4) and limit that way instead.
td234
01-18-2005, 06:20 AM
Correct me if I am wrong (which is quite possible), but it does not crawl down into subdirectories unless LIMIT_TO_DIRECTORY is set to false. I have changed it back to false and it does not find any results because links to go sub-directories.
Are you suggesting adding "\.\." to the FORBIDDEN_EXTENTIONS? I am not sure that would work on full URLs.
Thanks for taking the time to respond to my posts!
T
Charter
01-18-2005, 08:59 AM
LIMIT_TO_DIRECTORY does just that, limit to that directory, no up, no down, just within that directory. Set it to true to stay within a directory, set it to false to traverse up and down. You can use FORBIDDEN_EXTENTIONS by setting a regex to exclude links, depending on the regex, but a double dot is not sufficient. If the link matches what's in FORBIDDEN_EXTENSIONS, then it won't be indexed. You can also choose a site, click the update button, and then click the no way symbol to delete a directory and exclude it from future index.
td234
01-18-2005, 10:37 AM
Man, I can not tell you how much I appreciate that you take the time to reply. I know you are busy. Thanks a ton.
I have been working on my problem all morning. My problem is I want to get all the travel articles from some really huge sites and I can not afford to spider the whole site and the exclude them with the "no way". I really need to try and make this work where it only traverses "down" if a path is given.
I have been traversing your code all morning. I am figuring that it will be easiest to set a variable (say $parent_path) when the first (level==0) entry is made in the tempspider table. Then in spider.php right before indexing on line 513 I could do this test if level>0.
if($level==0) {
$parent_path = $temp_path;
}
if(isset($parent_path)) {
$parent_path = preg_replace("/\//","\/",$parent_path);
$parent_match = "/^$parent_path/";
if(!preg_match($parent_match, $temp_path)) { // if parent path not in current path, do not index
$ok_for_index=0;
}
}
You know your code more intimately than I do. Can you give me some feedback on my proposed solution and if this is is a good way to do this.
Thanks.
T
td234
01-18-2005, 11:40 AM
The above does not seem to work. Entries still get put in tempspider that are above the original directory. Will continue working and post again.
Charter
01-18-2005, 11:46 AM
Nah, it's easier than that...
In robot_functions.php find:
if($link['path'] != $add_include['in_path']) {
$link['ok'] = 0;
}
And replace with:
if(!eregi("^".$add_include['in_path'],$link['path'])) {
$link['ok'] = 0;
}
td234
01-18-2005, 12:55 PM
YES! That is exactly what I needed.
Just so this thread has all the information needed for future reference, I think it should be note that LIMIT_TO_DIRECTORY needs to set to true for this to work correctly.
I know you probably can not make a change like that to the code after so many people are using it, but I think that is how most would expect it to function if passed a URl with a directory. Big improvement I think, but I am bias because that is what I needed. ;>)
Thanks again for a great program.
T
chris614164
02-08-2005, 02:31 PM
why has this tweak not been included in the phpdig package, it's really great but now every time i want to update phpdig i have to remember to apply the fix, so i will visit this page until the end of my life, hmmm... :deer:
td234
02-08-2005, 02:37 PM
Actually, this fix has been added and is included in 1.8.8-rc1. Notice a new variable in the config.php called ALLOW_SUBDIRECTORIES. Set it to TRUE.
chris614164
02-08-2005, 03:09 PM
Actually, this fix has been added and is included in 1.8.8-rc1. Notice a new variable in the config.php called ALLOW_SUBDIRECTORIES. Set it to TRUE.
ok great sorry i have not downloaded the new one, but i will as soon as it is stable cause this tool really rocks :):)
vBulletin® v3.7.3, Copyright ©2000-2025, Jelsoft Enterprises Ltd.