PDA

View Full Version : Break the depth limit of 20?


WebSpider
02-07-2005, 01:10 AM
Is the Depth limit of 20 a script limitation? a resource limitation? some sort of loop avoidance?

I ask because I tried to spider a directory where each new page of results is considered a new level, and there are categories with more than 20 pages.

Can we break this limit somehow?

Thanks!

Charter
02-07-2005, 02:32 AM
Just change it in the config file:

define('SPIDER_MAX_LIMIT',20); // max (re)index search depth - used for shell and admin panel dropdown
define('RESPIDER_LIMIT',5); // max update search depth - only used for browser, not used for shell

define('LINKS_MAX_LIMIT',20); // max (re)index links per - used for shell and admin panel dropdown
define('RELINKS_LIMIT',5); // max update links per - only used for browser, not used for shell

WebSpider
02-07-2005, 06:47 AM
Thanks-a-bunch Charter!

Off-side, are you the only developer behind PHPDigger? Do u take donations?

Charter
02-07-2005, 12:16 PM
Antoine was the previous developer, releasing the initial version through v.1.6.2, and I have since been the current developer. There have also been contributions posted in the forums and/or listed in the CREDITS, CHANGELOG, and README files. Some history about the change in developers can be found here (http://www.phpdig.net/navigation.php?action=news).

WebSpider
02-08-2005, 02:19 PM
Thanks.

I changed the depth limit to 60 and now i try to rerun the spider over the same domain so it will add the rest of links not spidered beyond the initial 20 hops, however it won't spider any link but the very first page and then stop.

Ideas?

Charter
02-08-2005, 06:06 PM
Check the values in the update sites table via the admin panel.

WebSpider
02-08-2005, 11:11 PM
They match my proposals: depth 60 and links 0 (aka all).

Charter
02-09-2005, 11:30 AM
Some thoughts...

- Try using the textbox, 60, 0, no.
- View the robots.txt file for changes.
- Look for meta revisit-after/robots tags.
- Enter the site at a different location.

WebSpider
02-09-2005, 02:18 PM
- Used both text and combo box
- No robots.txt present
- No revisits on the code
- Thats the only thing i should try now. However, does it make sense to index both www.domain.com and domain.com when they're 99% of the times the same thing? shouldn't this be implemented (even as a switch?) on the code of the digger?

Charter
02-09-2005, 02:21 PM
Set PHPDIG_IN_DOMAIN to true in the config file.