PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 03-06-2006, 04:40 PM   #1
ASHM
Green Mole
 
Join Date: Mar 2006
Posts: 2
Smile Exclude paths for spider

Hi, I just want to exclude certain paths from the spider and couldn't find anything in the documentation.

I came to the forum and signed up in the hope that the answer would appear but the only results were filled with hidden replies by admins.

Surely this is a base function of PHPDig and its method shouldn't be excluded from those whose organisations are too cheap to pay the $5 registration fee.

I'd hoped it might be as simple as adding a - before the URLs but this doesn't seem to work. I know I managed to do this before but for the life of me can't rediscover how to do it through the admin or config file.

For example I want to spider domain.org.au but exclude the following paths and URL matches:

- ashm.org.au/admin
- ashm.org.au/module.php?id=

Could someone who's not an admin please reply to this or send me an email?

Thanks in advance!
ASHM is offline   Reply With Quote
Old 03-06-2006, 05:19 PM   #2
jimurl@montanai
Green Mole
 
Join Date: Feb 2006
Posts: 1
excluding pages

Thats a good question. I had the same question, and the same problem finding a response to it. I did run across this thread:

http://www.phpdig.net/forum/showthread.php?t=691

which included a link to yet another thread... but that link was broken.

But I also found a link to this thread:
http://www.phpdig.net/forum/showthread.php?t=1416

which, to cut to the chase says, if you make a "robots.txt" file, and put this in it:

User-agent: PhpDig
Disallow: /path/file1.php
Disallow: /path/file2.html

Then, when you index, it will skip those files. You put the robots.txt at the root level of your site.

I haven't yet actually re-indexed using this robots.txt file in place, but I bet it'll work... the guy at the link above says it will.

I already had the pages in my index which I wanted to exclude, or actually, just get rid of altogether. I played around and discovered a pair of mysql DELETE statement that would do that. It goes more or less like this:

select * from digengine where spider_id in (select spider_id from digspider where file like '%article.php?article_id=%' and path ='press/');

and

select spider_id from digspider where file like '%article.php' and path ='press/'

where "press/article.php" was the page that I wanted to remove from the search index. Also, you have to replace the "select..." with the appropriate "delete", but I would recommend playing with select first, to make sure you are getting rid of the right stuff. You have to use both statements, in that order, or you can can really screw things up. But, I was ready to blow away my database and start over with re-indexing, had I really FUBARed things.

I hope this helps.
jimurl@montanai is offline   Reply With Quote
Old 03-06-2006, 05:32 PM   #3
ASHM
Green Mole
 
Join Date: Mar 2006
Posts: 2
That's brilliant, Jim.

Many thanks. I was hoping there'd be a simple way to exclude them via the admin and I'm almost certain I found a thread somewhere on the Net which stated how it was done. But I'll be buggered if I can find it now.

Thanks for your help.
ASHM is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
exclude filenames felyx Troubleshooting 0 11-20-2006 09:29 PM
Can't exclude few pages mleray Troubleshooting 2 11-19-2004 12:25 AM
Exclude paths : -'*' -@NONE@ BootsWalker Troubleshooting 2 10-20-2004 06:12 PM
exclude metatags tomas How-to Forum 5 08-15-2004 03:22 PM
exclude after spidering baskamer Troubleshooting 2 03-01-2004 02:17 AM


All times are GMT -8. The time now is 11:32 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.