View Single Post
Old 03-12-2004, 05:53 PM   #10
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. I think I know what's going on. I don't believe PhpDig was originally written to crawl https and so the header redirect sends http to https but PhpDig does it's thing and https goes back to http but then the redirect sends it back to https, etcetera, so I think it's a loop.

If you search for http in the spider.php and robot_functions.php files, and maybe other files, you'll see that some code needs to change in order to account for https links. I'm not going to post a bunch of trial and error code now, but will instead work on it for inclusion in another release.

In the meantime, try sending the PhpDig robot a 403 when https is encountered based on user agent or something. Not tested, but if you want to crawl https one at a time via command line, the following might be okay:

In spider.php change:
PHP Code:
if (ereg('^http://',$argv[1])) { 
to the following:
PHP Code:
if (ereg('^http[s]?://',$argv[1])) { 
and then use the following:
Code:
prompt> php -f spider.php https://www.domain.com
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote