View Single Post
Old 10-11-2003, 09:21 AM   #1
alivin70
Orange Mole
 
alivin70's Avatar
 
Join Date: Sep 2003
Posts: 40
Exclamation New feature proposal: targeted indexing

Me and JyGius are working on a new feature.
I hope to have some feedback from other developers before to start.

We want to re-index very often a big website and we want to introduce some trick to dramaticaly reduce crawling.
It's not possile to use the modification data of files to select modified ones, because there are lots of dynamic pages.

For explample I can have news generated with an ID:
news.php?nid=10001
news.php?nid=10002
news.php?nid=10003
news.php?nid=10004
news.php?nid=10005
.....
news.php?nid=20000

but only last 4 have been modified since last visit.

How can the crawler know that?

Our idea is to add a directive in robots.txt containing the url of a text file containing the list of the modified/created pages with their timestamp.
For example:
1056987466 news.php?nid=20001
1056987853 news.php?nid=20002
1056988465 news.php?nid=20003
1056995765 news.php?nid=20004

So, Phpdig read that directive and load the text file, parse it and dig only pages modified after last visit without following links.

The text file must be created and mantained by the web site software. Obviously this applies to portals totally database driven.

If the robots.txt doesn't contain that directive Phpdig can crawl the site as usual.



If you have some idea please post it here.

Alivin70
alivin70 is offline   Reply With Quote