Subdirectories but not higher directories
I am spidering a site that I have no control over. I would like to spider a single directory on that site (same /dir1). I discovered that I must set LIMIT_TO_DIRECTORY to false because under /dir1 the content is in subdirectories. The problem I am having is that by setting LIMIT_TO_DIRECTORY to false, it also climbs out of the current directory (/dir1). Is there a method to start in one directory and to traverse it throughly (including subdirectories) but not climb out?
Thanks for a great tool! T |
|
Yes, I read that when I was researching an answer to my question. Since I do not have control over the site I am spidering I cannot use the robots.txt suggestion and the balance of that thread was concerning getting search results that only contain a certain directory. I am trying to limit my crawl to a specified directory (and its subdirectories). I guess I am a little surprized that LIMIT_TO_DIRECTORY allows the crawler to crawl out of a directory. Doesn't this defeat adding a path to the URL to crawl?
|
You need to set LIMIT_TO_DIRECTORY back to true and then enter links in the textbox like so:
Code:
http://www.domain.com/dir1/ |
Thanks Charter. My delima with your suggestion is that I do not know all the sub-directories, so I can not manually enter them. They could also change, go away, or be added, so I really need the crawling feature. I think my only solution is to modify the code and only crawl links that contain the original path. That way I will only crawl down, not up. This should be a simple comparison.
Is it only me, or does anyone else feel that crawling up after being given a path is unexpected behavior? T |
It only crawls up if LIMIT_TO_DIRECTORY is set to false. Look at FORBIDDEN_EXTENSIONS in section 4.3 of the documentation and limit that way instead.
|
Correct me if I am wrong (which is quite possible), but it does not crawl down into subdirectories unless LIMIT_TO_DIRECTORY is set to false. I have changed it back to false and it does not find any results because links to go sub-directories.
Are you suggesting adding "\.\." to the FORBIDDEN_EXTENTIONS? I am not sure that would work on full URLs. Thanks for taking the time to respond to my posts! T |
LIMIT_TO_DIRECTORY does just that, limit to that directory, no up, no down, just within that directory. Set it to true to stay within a directory, set it to false to traverse up and down. You can use FORBIDDEN_EXTENTIONS by setting a regex to exclude links, depending on the regex, but a double dot is not sufficient. If the link matches what's in FORBIDDEN_EXTENSIONS, then it won't be indexed. You can also choose a site, click the update button, and then click the no way symbol to delete a directory and exclude it from future index.
|
Man, I can not tell you how much I appreciate that you take the time to reply. I know you are busy. Thanks a ton.
I have been working on my problem all morning. My problem is I want to get all the travel articles from some really huge sites and I can not afford to spider the whole site and the exclude them with the "no way". I really need to try and make this work where it only traverses "down" if a path is given. I have been traversing your code all morning. I am figuring that it will be easiest to set a variable (say $parent_path) when the first (level==0) entry is made in the tempspider table. Then in spider.php right before indexing on line 513 I could do this test if level>0. if($level==0) { $parent_path = $temp_path; } if(isset($parent_path)) { $parent_path = preg_replace("/\//","\/",$parent_path); $parent_match = "/^$parent_path/"; if(!preg_match($parent_match, $temp_path)) { // if parent path not in current path, do not index $ok_for_index=0; } } You know your code more intimately than I do. Can you give me some feedback on my proposed solution and if this is is a good way to do this. Thanks. T |
The above does not seem to work. Entries still get put in tempspider that are above the original directory. Will continue working and post again.
|
Nah, it's easier than that...
In robot_functions.php find: PHP Code:
PHP Code:
|
YES! That is exactly what I needed.
Just so this thread has all the information needed for future reference, I think it should be note that LIMIT_TO_DIRECTORY needs to set to true for this to work correctly. I know you probably can not make a change like that to the code after so many people are using it, but I think that is how most would expect it to function if passed a URl with a directory. Big improvement I think, but I am bias because that is what I needed. ;>) Thanks again for a great program. T |
why has this tweak not been included in the phpdig package, it's really great but now every time i want to update phpdig i have to remember to apply the fix, so i will visit this page until the end of my life, hmmm... :deer:
|
Sub-directories fixed
Actually, this fix has been added and is included in 1.8.8-rc1. Notice a new variable in the config.php called ALLOW_SUBDIRECTORIES. Set it to TRUE.
|
Quote:
|
All times are GMT -8. The time now is 04:00 PM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.