PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 01-17-2005, 03:38 PM   #1
td234
Green Mole
 
Join Date: Jan 2005
Posts: 9
Subdirectories but not higher directories

I am spidering a site that I have no control over. I would like to spider a single directory on that site (same /dir1). I discovered that I must set LIMIT_TO_DIRECTORY to false because under /dir1 the content is in subdirectories. The problem I am having is that by setting LIMIT_TO_DIRECTORY to false, it also climbs out of the current directory (/dir1). Is there a method to start in one directory and to traverse it throughly (including subdirectories) but not climb out?

Thanks for a great tool!

T
td234 is offline   Reply With Quote
Old 01-17-2005, 04:59 PM   #2
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Have a look at this thread:
http://www.phpdig.net/forum/showthread.php?t=363

vinyl-junkie is offline   Reply With Quote
Old 01-17-2005, 08:01 PM   #3
td234
Green Mole
 
Join Date: Jan 2005
Posts: 9
Yes, I read that when I was researching an answer to my question. Since I do not have control over the site I am spidering I cannot use the robots.txt suggestion and the balance of that thread was concerning getting search results that only contain a certain directory. I am trying to limit my crawl to a specified directory (and its subdirectories). I guess I am a little surprized that LIMIT_TO_DIRECTORY allows the crawler to crawl out of a directory. Doesn't this defeat adding a path to the URL to crawl?
td234 is offline   Reply With Quote
Old 01-18-2005, 03:41 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
You need to set LIMIT_TO_DIRECTORY back to true and then enter links in the textbox like so:
Code:
http://www.domain.com/dir1/
http://www.domain.com/dir1/sub1/
http://www.domain.com/dir1/sub2/
With LIMIT_TO_DIRECTORY set to true, PhpDig indexes that (sub)directory only; no (sub)directories of (sub)directories are indexed.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-18-2005, 05:34 AM   #5
td234
Green Mole
 
Join Date: Jan 2005
Posts: 9
Thanks Charter. My delima with your suggestion is that I do not know all the sub-directories, so I can not manually enter them. They could also change, go away, or be added, so I really need the crawling feature. I think my only solution is to modify the code and only crawl links that contain the original path. That way I will only crawl down, not up. This should be a simple comparison.

Is it only me, or does anyone else feel that crawling up after being given a path is unexpected behavior?

T
td234 is offline   Reply With Quote
Old 01-18-2005, 05:40 AM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
It only crawls up if LIMIT_TO_DIRECTORY is set to false. Look at FORBIDDEN_EXTENSIONS in section 4.3 of the documentation and limit that way instead.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-18-2005, 06:20 AM   #7
td234
Green Mole
 
Join Date: Jan 2005
Posts: 9
Correct me if I am wrong (which is quite possible), but it does not crawl down into subdirectories unless LIMIT_TO_DIRECTORY is set to false. I have changed it back to false and it does not find any results because links to go sub-directories.

Are you suggesting adding "\.\." to the FORBIDDEN_EXTENTIONS? I am not sure that would work on full URLs.

Thanks for taking the time to respond to my posts!

T
td234 is offline   Reply With Quote
Old 01-18-2005, 08:59 AM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
LIMIT_TO_DIRECTORY does just that, limit to that directory, no up, no down, just within that directory. Set it to true to stay within a directory, set it to false to traverse up and down. You can use FORBIDDEN_EXTENTIONS by setting a regex to exclude links, depending on the regex, but a double dot is not sufficient. If the link matches what's in FORBIDDEN_EXTENSIONS, then it won't be indexed. You can also choose a site, click the update button, and then click the no way symbol to delete a directory and exclude it from future index.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-18-2005, 10:37 AM   #9
td234
Green Mole
 
Join Date: Jan 2005
Posts: 9
Man, I can not tell you how much I appreciate that you take the time to reply. I know you are busy. Thanks a ton.

I have been working on my problem all morning. My problem is I want to get all the travel articles from some really huge sites and I can not afford to spider the whole site and the exclude them with the "no way". I really need to try and make this work where it only traverses "down" if a path is given.

I have been traversing your code all morning. I am figuring that it will be easiest to set a variable (say $parent_path) when the first (level==0) entry is made in the tempspider table. Then in spider.php right before indexing on line 513 I could do this test if level>0.

if($level==0) {
$parent_path = $temp_path;
}
if(isset($parent_path)) {
$parent_path = preg_replace("/\//","\/",$parent_path);
$parent_match = "/^$parent_path/";
if(!preg_match($parent_match, $temp_path)) { // if parent path not in current path, do not index
$ok_for_index=0;
}
}

You know your code more intimately than I do. Can you give me some feedback on my proposed solution and if this is is a good way to do this.

Thanks.

T

Last edited by td234; 01-18-2005 at 11:28 AM.
td234 is offline   Reply With Quote
Old 01-18-2005, 11:40 AM   #10
td234
Green Mole
 
Join Date: Jan 2005
Posts: 9
The above does not seem to work. Entries still get put in tempspider that are above the original directory. Will continue working and post again.

Last edited by td234; 01-18-2005 at 11:42 AM.
td234 is offline   Reply With Quote
Old 01-18-2005, 11:46 AM   #11
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Nah, it's easier than that...

In robot_functions.php find:
PHP Code:
if($link['path'] != $add_include['in_path']) {
    
$link['ok'] = 0;

And replace with:
PHP Code:
if(!eregi("^".$add_include['in_path'],$link['path'])) {
    
$link['ok'] = 0;

__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-18-2005, 12:55 PM   #12
td234
Green Mole
 
Join Date: Jan 2005
Posts: 9
YES! That is exactly what I needed.

Just so this thread has all the information needed for future reference, I think it should be note that LIMIT_TO_DIRECTORY needs to set to true for this to work correctly.

I know you probably can not make a change like that to the code after so many people are using it, but I think that is how most would expect it to function if passed a URl with a directory. Big improvement I think, but I am bias because that is what I needed. ;>)

Thanks again for a great program.

T
td234 is offline   Reply With Quote
Old 02-08-2005, 02:31 PM   #13
chris614164
Green Mole
 
Join Date: Feb 2005
Posts: 5
Wink

why has this tweak not been included in the phpdig package, it's really great but now every time i want to update phpdig i have to remember to apply the fix, so i will visit this page until the end of my life, hmmm...
chris614164 is offline   Reply With Quote
Old 02-08-2005, 02:37 PM   #14
td234
Green Mole
 
Join Date: Jan 2005
Posts: 9
Sub-directories fixed

Actually, this fix has been added and is included in 1.8.8-rc1. Notice a new variable in the config.php called ALLOW_SUBDIRECTORIES. Set it to TRUE.
td234 is offline   Reply With Quote
Old 02-08-2005, 03:09 PM   #15
chris614164
Green Mole
 
Join Date: Feb 2005
Posts: 5
Lightbulb

Quote:
Originally Posted by td234
Actually, this fix has been added and is included in 1.8.8-rc1. Notice a new variable in the config.php called ALLOW_SUBDIRECTORIES. Set it to TRUE.
ok great sorry i have not downloaded the new one, but i will as soon as it is stable cause this tool really rocks
chris614164 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Indexing new directories bugmenot How-to Forum 1 03-28-2006 03:33 AM
indexing directories iconeweb Troubleshooting 1 12-04-2005 01:27 AM
Don't index all subdirectories Shadowalex Troubleshooting 2 01-10-2005 03:03 AM
Un restrict to sub directories gooseman How-to Forum 2 11-11-2004 03:32 PM
How to exclude subdirectories while indexing a site philbihr How-to Forum 2 11-10-2004 02:48 AM


All times are GMT -8. The time now is 08:28 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.