![]() |
|
![]() |
#1 |
Green Mole
Join Date: Mar 2006
Posts: 2
|
![]()
Hi, I just want to exclude certain paths from the spider and couldn't find anything in the documentation.
I came to the forum and signed up in the hope that the answer would appear but the only results were filled with hidden replies by admins. Surely this is a base function of PHPDig and its method shouldn't be excluded from those whose organisations are too cheap to pay the $5 registration fee. I'd hoped it might be as simple as adding a - before the URLs but this doesn't seem to work. I know I managed to do this before but for the life of me can't rediscover how to do it through the admin or config file. For example I want to spider domain.org.au but exclude the following paths and URL matches: - ashm.org.au/admin - ashm.org.au/module.php?id= Could someone who's not an admin please reply to this or send me an email? Thanks in advance! |
![]() |
![]() |
![]() |
#2 |
Green Mole
Join Date: Feb 2006
Posts: 1
|
excluding pages
Thats a good question. I had the same question, and the same problem finding a response to it. I did run across this thread:
http://www.phpdig.net/forum/showthread.php?t=691 which included a link to yet another thread... but that link was broken. But I also found a link to this thread: http://www.phpdig.net/forum/showthread.php?t=1416 which, to cut to the chase says, if you make a "robots.txt" file, and put this in it: User-agent: PhpDig Disallow: /path/file1.php Disallow: /path/file2.html Then, when you index, it will skip those files. You put the robots.txt at the root level of your site. I haven't yet actually re-indexed using this robots.txt file in place, but I bet it'll work... the guy at the link above says it will. I already had the pages in my index which I wanted to exclude, or actually, just get rid of altogether. I played around and discovered a pair of mysql DELETE statement that would do that. It goes more or less like this: select * from digengine where spider_id in (select spider_id from digspider where file like '%article.php?article_id=%' and path ='press/'); and select spider_id from digspider where file like '%article.php' and path ='press/' where "press/article.php" was the page that I wanted to remove from the search index. Also, you have to replace the "select..." with the appropriate "delete", but I would recommend playing with select first, to make sure you are getting rid of the right stuff. You have to use both statements, in that order, or you can can really screw things up. But, I was ready to blow away my database and start over with re-indexing, had I really FUBARed things. I hope this helps. |
![]() |
![]() |
![]() |
#3 |
Green Mole
Join Date: Mar 2006
Posts: 2
|
That's brilliant, Jim.
Many thanks. I was hoping there'd be a simple way to exclude them via the admin and I'm almost certain I found a thread somewhere on the Net which stated how it was done. But I'll be buggered if I can find it now. Thanks for your help. |
![]() |
![]() |
![]() |
|
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
exclude filenames | felyx | Troubleshooting | 0 | 11-20-2006 09:29 PM |
Can't exclude few pages | mleray | Troubleshooting | 2 | 11-19-2004 12:25 AM |
Exclude paths : -'*' -@NONE@ | BootsWalker | Troubleshooting | 2 | 10-20-2004 06:12 PM |
exclude metatags | tomas | How-to Forum | 5 | 08-15-2004 03:22 PM |
exclude after spidering | baskamer | Troubleshooting | 2 | 03-01-2004 02:17 AM |