PhpDig.net - View Single Post - Why aren't all links followed when indexing?

Vertikal · 06-26-2005, 02:46 AM

Yesterday when running the crawler on yet another page that had not been reached by through the links on the page, I noticed that the crawler followed external links (we have very few of those) and started crawling someone else's site. I halted the crawler, since I don't want that, and started fiddling around with the config.php file.

I seemed to remember some setting for crawling external domains but during my browsing I noted two things: the PHPDIG_IN_DOMAIN and the ABSOLUTE_SCRIPT_PATH settings. I changed the first one to false (re. my www. problem as sketched above) and just to set things up properly, I also entered the complete path for the script. This path had before been the default value and had never been changed.

I saved and uploaded the config file, ran the script on the page again and if not it started traversing the whole site and following every single link!

It ran for more than 5 hours, raising the number of indexed pages from 2,500 to close to 7,000 which seems a lot closer to the real number of pages on the site.

I just let the crawler loose on another new page, and currently it's running through its 1000's page and seems to be on a new spree across the whole site and all links. There are many duplicate pages of course, but it seems to find a new one now and then.
I have set the timeout in the script to zero, and it seems that I now get what I want: a complete and quick index process with all links followed.

I don't know whether the changes to config.php did this, but something has changed.

At least I'm closer to what I wanted in the first place.

PS: I never found that setting for the external links. I'll have to dig for that one.

Martin

06-26-2005, 02:46 AM	#7
Vertikal Former Member Join Date: Jun 2005 Posts: 6	Problem (maybe) solved Yesterday when running the crawler on yet another page that had not been reached by through the links on the page, I noticed that the crawler followed external links (we have very few of those) and started crawling someone else's site. I halted the crawler, since I don't want that, and started fiddling around with the config.php file. I seemed to remember some setting for crawling external domains but during my browsing I noted two things: the PHPDIG_IN_DOMAIN and the ABSOLUTE_SCRIPT_PATH settings. I changed the first one to false (re. my www. problem as sketched above) and just to set things up properly, I also entered the complete path for the script. This path had before been the default value and had never been changed. I saved and uploaded the config file, ran the script on the page again and if not it started traversing the whole site and following every single link! It ran for more than 5 hours, raising the number of indexed pages from 2,500 to close to 7,000 which seems a lot closer to the real number of pages on the site. I just let the crawler loose on another new page, and currently it's running through its 1000's page and seems to be on a new spree across the whole site and all links. There are many duplicate pages of course, but it seems to find a new one now and then. I have set the timeout in the script to zero, and it seems that I now get what I want: a complete and quick index process with all links followed. I don't know whether the changes to config.php did this, but something has changed. At least I'm closer to what I wanted in the first place. PS: I never found that setting for the external links. I'll have to dig for that one. Martin