PDA

View Full Version : Typical run times...


rayvd
10-09-2003, 07:19 AM
Yes, I know message boards have search features built in to them... :)

Nevertheless, we have been setting up phpdig and as a test have had it spider several message boards hosted on our server. One such board has about 9000 posts, and I realize, probably a lot of links that loop round and round... I set recursion at 2 and let phpdig go. 16 hours later it was still at it!

Is this typical? What are some runtimes some of you have experienced, and on what size of a site? Not necessarily looking for other message board crawling times, just anything in general that I can compare against.

Since some of these sites are "our" sites and on a local connection, it might be prudent to remove the sleep(2) call in spider.php to speed things up...

Charter
10-09-2003, 04:43 PM
Hi. You might try using the PhpDig include and exclude comments for the header and footer, and if not already done, try running PhpDig from shell rather than from a browser.

Another idea, off the top of my head, would be to write a quick script to port the post URLs to a file, and then just crawl that file at level one.

renehaentjens
11-19-2003, 02:17 PM
For those sites that take a looong time to index, it might be nice to have interruptible indexing (stop for a while, I'll tell you when to continue) - but that's a mod request and should be placed elsewhere I guess?

Anyway, if I can influence the design of the site to be indexed, what advice can I get from the gurus? What are the DOs and the DONTs for quick indexing? Where does PhpDig loose a lot of time when indexing sites? ...

rayvd
11-19-2003, 02:20 PM
What I ended up doing was breaking my list of URL's into 7, 8, 9 or 10 sublists and then starting a crawler for each of them.

Pseudo-threading!

Doesn't make any individual site crawl faster, but the whole gets completed quicker.