View Single Post
Old 06-11-2005, 06:52 AM   #1
Vertikal
Former Member
 
Join Date: Jun 2005
Posts: 6
Why aren't all links followed when indexing?

I have installed PhpDig on a site that has thousands of pages. Most articles consist of up to dozens of pages.

I have managed to get the spider to index most of the site by feeding it the root URL for the site, and it has dutifully and several times spent the better part of a couple of hours traversing the pages.

But... it does not get into all corners. There is no logic in the way that it omits pages. I have tried all the tricks I have found here:
- tweaking the config file
- selecting different depths and link numbers (and yes/no)
- using the update interface

But for some reason it still won't go into all corners.

If I then manually feed it one of the dead-end pages, it will index it and report "no links in temp table". Sometimes... At other times it will crawl the subpages. In the first case I then feed it one of the subpages for that article, and voila! It browses the subpage, goes to the level above, happily finds all the links and browses all the other subpages. Sometimes...
At other times it does not come beyond the subpage I enter.

When I feed it the sitemap, which should be an excellent place to start. The sitemap contains hundreds of links to publicly available pages, still the spider crawls none of them.

There is no pattern in the behaviour of the spider - or not one I can see at least. The site is large and criss-cross connected with many links, and there should be ample opportunities for the spider to find branches to crawl.

The spider will always crawl nicely and end with a proper result, updating all tables and let me return to the previous page. I have no timeout problems or the like.

I would love just to be able to feed it the root, and then have in crawl out into every branch of the site.

Why can't I get it to do that?

Eventually I will set up a cron job to do the task, and if it does not crawl everything, updates might not be caught.

The site is: Global FlyFisher and here is the search page.

Apart from this problem I love the product! Good job. I can of course manually enter all URL's which seems to work. But for an automated and very large homepage that is hell.

Regards

Martin

Last edited by Vertikal; 06-11-2005 at 06:55 AM.
Vertikal is offline   Reply With Quote