|
01-17-2005, 03:38 PM | #1 |
Green Mole
Join Date: Jan 2005
Posts: 9
|
Subdirectories but not higher directories
I am spidering a site that I have no control over. I would like to spider a single directory on that site (same /dir1). I discovered that I must set LIMIT_TO_DIRECTORY to false because under /dir1 the content is in subdirectories. The problem I am having is that by setting LIMIT_TO_DIRECTORY to false, it also climbs out of the current directory (/dir1). Is there a method to start in one directory and to traverse it throughly (including subdirectories) but not climb out?
Thanks for a great tool! T |
01-17-2005, 04:59 PM | #2 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
|
01-17-2005, 08:01 PM | #3 |
Green Mole
Join Date: Jan 2005
Posts: 9
|
Yes, I read that when I was researching an answer to my question. Since I do not have control over the site I am spidering I cannot use the robots.txt suggestion and the balance of that thread was concerning getting search results that only contain a certain directory. I am trying to limit my crawl to a specified directory (and its subdirectories). I guess I am a little surprized that LIMIT_TO_DIRECTORY allows the crawler to crawl out of a directory. Doesn't this defeat adding a path to the URL to crawl?
|
01-18-2005, 03:41 AM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
You need to set LIMIT_TO_DIRECTORY back to true and then enter links in the textbox like so:
Code:
http://www.domain.com/dir1/ http://www.domain.com/dir1/sub1/ http://www.domain.com/dir1/sub2/
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-18-2005, 05:34 AM | #5 |
Green Mole
Join Date: Jan 2005
Posts: 9
|
Thanks Charter. My delima with your suggestion is that I do not know all the sub-directories, so I can not manually enter them. They could also change, go away, or be added, so I really need the crawling feature. I think my only solution is to modify the code and only crawl links that contain the original path. That way I will only crawl down, not up. This should be a simple comparison.
Is it only me, or does anyone else feel that crawling up after being given a path is unexpected behavior? T |
01-18-2005, 05:40 AM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
It only crawls up if LIMIT_TO_DIRECTORY is set to false. Look at FORBIDDEN_EXTENSIONS in section 4.3 of the documentation and limit that way instead.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-18-2005, 06:20 AM | #7 |
Green Mole
Join Date: Jan 2005
Posts: 9
|
Correct me if I am wrong (which is quite possible), but it does not crawl down into subdirectories unless LIMIT_TO_DIRECTORY is set to false. I have changed it back to false and it does not find any results because links to go sub-directories.
Are you suggesting adding "\.\." to the FORBIDDEN_EXTENTIONS? I am not sure that would work on full URLs. Thanks for taking the time to respond to my posts! T |
01-18-2005, 08:59 AM | #8 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
LIMIT_TO_DIRECTORY does just that, limit to that directory, no up, no down, just within that directory. Set it to true to stay within a directory, set it to false to traverse up and down. You can use FORBIDDEN_EXTENTIONS by setting a regex to exclude links, depending on the regex, but a double dot is not sufficient. If the link matches what's in FORBIDDEN_EXTENSIONS, then it won't be indexed. You can also choose a site, click the update button, and then click the no way symbol to delete a directory and exclude it from future index.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-18-2005, 10:37 AM | #9 |
Green Mole
Join Date: Jan 2005
Posts: 9
|
Man, I can not tell you how much I appreciate that you take the time to reply. I know you are busy. Thanks a ton.
I have been working on my problem all morning. My problem is I want to get all the travel articles from some really huge sites and I can not afford to spider the whole site and the exclude them with the "no way". I really need to try and make this work where it only traverses "down" if a path is given. I have been traversing your code all morning. I am figuring that it will be easiest to set a variable (say $parent_path) when the first (level==0) entry is made in the tempspider table. Then in spider.php right before indexing on line 513 I could do this test if level>0. if($level==0) { $parent_path = $temp_path; } if(isset($parent_path)) { $parent_path = preg_replace("/\//","\/",$parent_path); $parent_match = "/^$parent_path/"; if(!preg_match($parent_match, $temp_path)) { // if parent path not in current path, do not index $ok_for_index=0; } } You know your code more intimately than I do. Can you give me some feedback on my proposed solution and if this is is a good way to do this. Thanks. T Last edited by td234; 01-18-2005 at 11:28 AM. |
01-18-2005, 11:40 AM | #10 |
Green Mole
Join Date: Jan 2005
Posts: 9
|
The above does not seem to work. Entries still get put in tempspider that are above the original directory. Will continue working and post again.
Last edited by td234; 01-18-2005 at 11:42 AM. |
01-18-2005, 11:46 AM | #11 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Nah, it's easier than that...
In robot_functions.php find: PHP Code:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-18-2005, 12:55 PM | #12 |
Green Mole
Join Date: Jan 2005
Posts: 9
|
YES! That is exactly what I needed.
Just so this thread has all the information needed for future reference, I think it should be note that LIMIT_TO_DIRECTORY needs to set to true for this to work correctly. I know you probably can not make a change like that to the code after so many people are using it, but I think that is how most would expect it to function if passed a URl with a directory. Big improvement I think, but I am bias because that is what I needed. ;>) Thanks again for a great program. T |
02-08-2005, 02:31 PM | #13 |
Green Mole
Join Date: Feb 2005
Posts: 5
|
why has this tweak not been included in the phpdig package, it's really great but now every time i want to update phpdig i have to remember to apply the fix, so i will visit this page until the end of my life, hmmm...
|
02-08-2005, 02:37 PM | #14 |
Green Mole
Join Date: Jan 2005
Posts: 9
|
Sub-directories fixed
Actually, this fix has been added and is included in 1.8.8-rc1. Notice a new variable in the config.php called ALLOW_SUBDIRECTORIES. Set it to TRUE.
|
02-08-2005, 03:09 PM | #15 | |
Green Mole
Join Date: Feb 2005
Posts: 5
|
Quote:
|
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Indexing new directories | bugmenot | How-to Forum | 1 | 03-28-2006 03:33 AM |
indexing directories | iconeweb | Troubleshooting | 1 | 12-04-2005 01:27 AM |
Don't index all subdirectories | Shadowalex | Troubleshooting | 2 | 01-10-2005 03:03 AM |
Un restrict to sub directories | gooseman | How-to Forum | 2 | 11-11-2004 03:32 PM |
How to exclude subdirectories while indexing a site | philbihr | How-to Forum | 2 | 11-10-2004 02:48 AM |