PhpDig.net - View Single Post - Exclude filenames with certain attributes?

the_hut2 · 12-30-2004, 07:32 AM

All installed and spidering nicely. However, the directory in which the content resides contains a mailarchive of 8000 messages, each of which has its own .html file. Additionally it contains indexes for every 15 messages, again each index has its own .html file.

If the files are in the form:

http://www.mydomain.com/message001.html
http://www.mydomain.com/message002.html
http://www.mydomain.com/message003.html
etc

and

http://www.mydomain.com/index001.html
http://www.mydomain.com/index002.html

is it possible to restrict the spider so that any file which contains the characters "index" in the title is ignored or, alternatively, restrict the spider such that it only searches files which start with the characters "message"?

I had hoped that robots.txt would have been the answer, but you cannot use wildcards to specify exclusions. Typing in a list of the path of every file beginning with "index" is not an option (there are over 150 of them) and in any case, the content is updated every day....

Any help MUCH appreciated

12-30-2004, 07:32 AM	#1
the_hut2 Green Mole Join Date: Dec 2004 Posts: 3	Exclude filenames with certain attributes? All installed and spidering nicely. However, the directory in which the content resides contains a mailarchive of 8000 messages, each of which has its own .html file. Additionally it contains indexes for every 15 messages, again each index has its own .html file. If the files are in the form: http://www.mydomain.com/message001.html http://www.mydomain.com/message002.html http://www.mydomain.com/message003.html etc and http://www.mydomain.com/index001.html http://www.mydomain.com/index002.html is it possible to restrict the spider so that any file which contains the characters "index" in the title is ignored or, alternatively, restrict the spider such that it only searches files which start with the characters "message"? I had hoped that robots.txt would have been the answer, but you cannot use wildcards to specify exclusions. Typing in a list of the path of every file beginning with "index" is not an option (there are over 150 of them) and in any case, the content is updated every day.... Any help MUCH appreciated