PDA

View Full Version : Why aren't all links followed when indexing?


Vertikal
06-11-2005, 06:52 AM
I have installed PhpDig on a site that has thousands of pages. Most articles consist of up to dozens of pages.

I have managed to get the spider to index most of the site by feeding it the root URL for the site, and it has dutifully and several times spent the better part of a couple of hours traversing the pages.

But... it does not get into all corners. There is no logic in the way that it omits pages. I have tried all the tricks I have found here:
- tweaking the config file
- selecting different depths and link numbers (and yes/no)
- using the update interface

But for some reason it still won't go into all corners.

If I then manually feed it one of the dead-end pages, it will index it and report "no links in temp table". Sometimes... At other times it will crawl the subpages. In the first case I then feed it one of the subpages for that article, and voila! It browses the subpage, goes to the level above, happily finds all the links and browses all the other subpages. Sometimes...
At other times it does not come beyond the subpage I enter.

When I feed it the sitemap, which should be an excellent place to start. The sitemap contains hundreds of links to publicly available pages, still the spider crawls none of them.

There is no pattern in the behaviour of the spider - or not one I can see at least. The site is large and criss-cross connected with many links, and there should be ample opportunities for the spider to find branches to crawl.

The spider will always crawl nicely and end with a proper result, updating all tables and let me return to the previous page. I have no timeout problems or the like.

I would love just to be able to feed it the root, and then have in crawl out into every branch of the site.

Why can't I get it to do that?

Eventually I will set up a cron job to do the task, and if it does not crawl everything, updates might not be caught.

The site is: Global FlyFisher (http://globalflyfisher.com/) and here is the search page (http://globalflyfisher.com/find/).

Apart from this problem I love the product! Good job. I can of course manually enter all URL's which seems to work. But for an automated and very large homepage that is hell.

Regards

Martin

Charter
06-12-2005, 03:17 AM
Try this combination: set LIMIT_TO_DIRECTORY to false, set PHPDIG_IN_DOMAIN to true, set "search depth" to a large number, set "links per" to zero, choose the "no" option. Note that PhpDig follows links when indexing, but that it does not follow links in heavy JavaScript. Also, you might find this (http://www.phpdig.net/forum/showthread.php?t=1139) thread useful.

Vertikal
06-13-2005, 10:17 AM
Charter,

Thanks for your prompt reply. I had already tried your advice based in other threads.

I just rechecked the settings in my config-file, and they were set as you recommended. I selected 20 as the depth and zero links per and set the default values option to no.

I tried to index the page http://globalflyfisher.com/patterns/

The result was the usual:
-------------------------------
1:http://globalflyfisher.com/patterns/
(time : 00:00:01)

No link in temporary table
links found : 1
http://globalflyfisher.com/patterns/
Optimizing tables...
Indexing complete ! [Back] to admin interface.
-------------------------------

And that page has plenty links and no fancy JavaScript or anything else to provoke the crawler. My robots.txt does not exclude the page and if I try a r****m subpage which is linked from the page above, like http://globalflyfisher.com/patterns/branchu/ the crawler willingly crawls the links and subpages on that page...!

-------------------------------
1:http://globalflyfisher.com/patterns/branchu/
(time : 00:00:01)
+ + + + +
level 1...
2:http://globalflyfisher.com/patterns/branchu/pic.php?id=1517&caller=index
(time : 00:00:02)

3:http://globalflyfisher.com/patterns/branchu/pic.php?id=1516&caller=index
(time : 00:00:03)

4:http://globalflyfisher.com/patterns/branchu/pic.php?id=1515&caller=index
(time : 00:00:03)

5:http://globalflyfisher.com/patterns/branchu/all.php
(time : 00:00:04)
+ + + +
6:http://globalflyfisher.com/patterns/branchu/index.php
(time : 00:00:06)

level 2...
7:http://globalflyfisher.com/patterns/branchu/pic.php?id=1516&caller=all
(time : 00:00:07)

8:http://globalflyfisher.com/patterns/branchu/pic.php?id=1517&caller=all
(time : 00:00:08)

9:http://globalflyfisher.com/patterns/branchu/pic.php?id=1515&caller=all
(time : 00:00:09)

10:http://globalflyfisher.com/patterns/branchu/pic.php?id=1514&caller=all
(time : 00:00:09)

No link in temporary table
links found : 10
http://globalflyfisher.com/patterns/branchu/
http://globalflyfisher.com/patterns/branchu/pic.php?id=1517&caller=index
http://globalflyfisher.com/patterns/branchu/pic.php?id=1516&caller=index
http://globalflyfisher.com/patterns/branchu/pic.php?id=1515&caller=index
http://globalflyfisher.com/patterns/branchu/all.php
http://globalflyfisher.com/patterns/branchu/index.php
http://globalflyfisher.com/patterns/branchu/pic.php?id=1516&caller=all
http://globalflyfisher.com/patterns/branchu/pic.php?id=1517&caller=all
http://globalflyfisher.com/patterns/branchu/pic.php?id=1515&caller=all
http://globalflyfisher.com/patterns/branchu/pic.php?id=1514&caller=all
Optimizing tables...
Indexing complete ! [Back] to admin interface.
-------------------------------

I can't see any pattern in this behaviour.

I use the phpdigExclude and phpdigInclude tags to exclude head and footer, and that seems to work OK.

Would it help me to clean out the whole search index and retry the process from scratch? I hate such solutions, which really don't explain anything, but I will try it if I can obtain the desired effect: a complete site indexing in one go.

Again: thanks for a great product. It does a very good job with the pages it has indexed.

Martin

Charter
06-13-2005, 08:17 PM
Just looked at some of your HTML (wasn't able to get to your site the other day) and you have a bunch of ../ relative links that I suspect are getting wiped by the phpdigRewriteUrl function, so those links don't get indexed. I'll run some tests and see what I can mod for such relative links.

Vertikal
06-14-2005, 03:04 AM
Charter,

All my sites use relative links and rarely absolute ones relative to the root (like '/directory/directory/file.htm' rather than '../directory/file.htm') and never links with http:// when links point to pages within the same domain.

The reason is transportability. Sometimes a whole subdir moves location before going online. The sites also need to be able to run in different environments - Windows for development and Linux when running live - and need to be able to move between hosts with different names, be it local or online.

In PHP you can always get the fully qualified URL from a relative one using some of the path-functions in combination with the $_SERVER globals.
The browsers don't care the least. They can follow all links as long as the end-result is valid. I would recommend that a crawler did the same thing - if that is the problem.

Again: thanks for the effort you put in this.

Martin

Vertikal
06-16-2005, 02:43 PM
Charter,

I don't know whether this will help, but during my manual indexing of a lot of single pages on the site in question, something different happened.

The site is usually referred to as http://globalflyfisher.com/ with no www. But on one page there was a reference to some subpage with the http://www.globalflyfisher.com/ domain. And lo and behold if the crawler didn't start crawling all links it met on every single page... and it's still crawling as I write this.

Again I cannot see any reason for this, but I thought it might hint where to look for the explanation. Might it be because the www. site was "new and untouched"?

Martin

Vertikal
06-26-2005, 02:46 AM
Yesterday when running the crawler on yet another page that had not been reached by through the links on the page, I noticed that the crawler followed external links (we have very few of those) and started crawling someone else's site. I halted the crawler, since I don't want that, and started fiddling around with the config.php file.

I seemed to remember some setting for crawling external domains but during my browsing I noted two things: the PHPDIG_IN_DOMAIN and the ABSOLUTE_SCRIPT_PATH settings. I changed the first one to false (re. my www. problem as sketched above) and just to set things up properly, I also entered the complete path for the script. This path had before been the default value and had never been changed.

I saved and uploaded the config file, ran the script on the page again and if not it started traversing the whole site and following every single link!

It ran for more than 5 hours, raising the number of indexed pages from 2,500 to close to 7,000 which seems a lot closer to the real number of pages on the site.

I just let the crawler loose on another new page, and currently it's running through its 1000's page and seems to be on a new spree across the whole site and all links. There are many duplicate pages of course, but it seems to find a new one now and then.
I have set the timeout in the script to zero, and it seems that I now get what I want: a complete and quick index process with all links followed.

I don't know whether the changes to config.php did this, but something has changed.

At least I'm closer to what I wanted in the first place.

PS: I never found that setting for the external links. I'll have to dig for that one.

Martin