PDA

View Full Version : Exclude paths for spider


ASHM
03-06-2006, 04:40 PM
Hi, I just want to exclude certain paths from the spider and couldn't find anything in the documentation.

I came to the forum and signed up in the hope that the answer would appear but the only results were filled with hidden replies by admins.

Surely this is a base function of PHPDig and its method shouldn't be excluded from those whose organisations are too cheap to pay the $5 registration fee.

I'd hoped it might be as simple as adding a - before the URLs but this doesn't seem to work. I know I managed to do this before but for the life of me can't rediscover how to do it through the admin or config file.

For example I want to spider domain.org.au but exclude the following paths and URL matches:

- ashm.org.au/admin
- ashm.org.au/module.php?id=

Could someone who's not an admin please reply to this or send me an email?

Thanks in advance!

jimurl@montanai
03-06-2006, 05:19 PM
Thats a good question. I had the same question, and the same problem finding a response to it. I did run across this thread:

http://www.phpdig.net/forum/showthread.php?t=691

which included a link to yet another thread... but that link was broken.

But I also found a link to this thread:
http://www.phpdig.net/forum/showthread.php?t=1416

which, to cut to the chase says, if you make a "robots.txt" file, and put this in it:

User-agent: PhpDig
Disallow: /path/file1.php
Disallow: /path/file2.html

Then, when you index, it will skip those files. You put the robots.txt at the root level of your site.

I haven't yet actually re-indexed using this robots.txt file in place, but I bet it'll work... the guy at the link above says it will.

I already had the pages in my index which I wanted to exclude, or actually, just get rid of altogether. I played around and discovered a pair of mysql DELETE statement that would do that. It goes more or less like this:

select * from digengine where spider_id in (select spider_id from digspider where file like '%article.php?article_id=%' and path ='press/');

and

select spider_id from digspider where file like '%article.php' and path ='press/'

where "press/article.php" was the page that I wanted to remove from the search index. Also, you have to replace the "select..." with the appropriate "delete", but I would recommend playing with select first, to make sure you are getting rid of the right stuff. You have to use both statements, in that order, or you can can really screw things up. But, I was ready to blow away my database and start over with re-indexing, had I really FUBARed things.

I hope this helps.

ASHM
03-06-2006, 05:32 PM
That's brilliant, Jim.

Many thanks. I was hoping there'd be a simple way to exclude them via the admin and I'm almost certain I found a thread somewhere on the Net which stated how it was done. But I'll be buggered if I can find it now.

Thanks for your help.