PDA

View Full Version : Keeping the spider in the search directory and its subdirs


ciaran@clissman
05-24-2004, 05:59 AM
Hi,
I typically want to spider just a subdirectory and its subdirs, so that I don't want the spider to go up into the parent directory of the URL that I specify.

e.g. I want to index all of www.myplace.com/searchme

The starting point is www.myplace.com/searchme/index.html
I want all the other stuff in /searchme to be indexed, but I don't want www.myplace.come/donttouch, EVEN THOUGH there is a link from /searchme/index.html to /donttouch/index.html.

IS there a way to tell PHPdig not to 'go up' in the directory hierarchy ?

Thanks a lot !:bang:

vinyl-junkie
05-24-2004, 06:17 AM
Welcome to the forum, ciaran@clissman. :D

In the includes/config.php file, find the following statement:define('PHPDIG_IN_DOMAIN',false);and replace it with this: define('PHPDIG_IN_DOMAIN',true);

ciaran@clissman
05-24-2004, 06:42 AM
Thanks, Pat,

but it's not doing what I expect.

I ask it to index http://www.waterfordcity.ie/library/ballybricken.htm with a search depth of 3 and with
define('PHPDIG_IN_DOMAIN',true);


and the first few results are

SITE : http://www.waterfordcity.ie/
Exclude paths :
- @NONE@
1:http://www.waterfordcity.ie/library/ballybricken.htm
(time : 00:00:07)
+ + + + + +
level 1...
2:http://www.waterfordcity.ie/library/
(time : 00:00:20)

3:http://www.waterfordcity.ie/gallery.htm
(time : 00:00:28)
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
4:http://www.waterfordcity.ie/environment/index.htm
(time : 00:00:37)
+
5:http://www.waterfordcity.ie/planning/index.htm
(time : 00:00:46)
+ +

While really what I want is that everything in http://www.waterfordcity.ie/library is indexed and nothing else

Any thoughts ?

Thanks again !

Ciaran

vinyl-junkie
05-24-2004, 07:22 AM
Sorry, I misunderstood what you were asking. You would like for phpdig to stay in a specific directory when spidering, right? In that case, this thread (http://www.phpdig.net/showthread.php?s=&threadid=363&highlight=directory) has what you need.

ciaran@clissman
05-24-2004, 07:28 AM
Hmm, we're not there yet.

The sites I crawling aren't mine, so I can't put robot.txt files into them.

Is there not a function someplace that says
' if the directory of the page you are thinking about indexing is the parent directory of the page you were started at, leave it alone (or not, depending on the config variable)' ?

thanks again

Ciaran

vinyl-junkie
05-24-2004, 07:40 AM
To my knowledge, the method that I gave you is the only way you can have phpdig stay within the directory you specify. I'm not sure what new features may end up in version 1.8.1, but I know Charter is working on that right now. Perhaps he'll consider adding this as a feature. I know it's a subject that comes up fairly often around here.

ciaran@clissman
05-24-2004, 07:44 AM
Good enough. Thanks for the tips !

Ciaran [sunny Dublin, Ireland, quarter to five in the afternoon]