View Full Version : Indexing cookie/session authenticated pages

01-08-2004, 05:34 PM

We have installed phpdig 1.6.5 and are facing a problem indexing authenticated pages on our site (More than half the pages on our site use cookie-based authentication). The indexer just ends up accessing publicly accessible pages.

We've looked through the code for spider.php and robot_functions.php and found many references to cookie related functions (such as phpDigMakeCookies()), but haven't been able to enable them.

Is there any documentation for this? Or could someone provide the steps for providing cookie information.

Some background info: Users are authenticated on our site using only a username/password combination provided through a login form on the pages. No pages are .htaccess protected.


01-09-2004, 03:51 AM
Hi. The functions send HEAD and GET requests. You can see an example in this (http://www.phpdig.net/showthread.php?threadid=360&perpage=15&pagenumber=2) thread.

Basically the HEAD requests check status and the GET requests grab content. There is nothing in these functions to be turned on so PhpDig can crawl authenticated pages.

One thing you might try in the authenticated pages is adding a check for PhpDig. If PhpDig, show content from pages normally needing authentication, if not PhpDig, require user to authenticate.

If the authenticated pages are using PHP, you may find the list of reserved variables here (http://www.php.net/reserved.variables) useful.

01-09-2004, 01:17 PM
Thanks for the tip Charter, your idea works!

Checking for the user agent is the way we have chosen to go.

06-26-2004, 12:33 AM
Hi Charter,

I am having the same problems -I want to test for the PHP Dig spinder in the USER_AGENT header, but I have no idea what agent it will be? Any ideas?

06-26-2004, 08:44 AM
Originally posted by bforsyth
Hi Charter,

I am having the same problems -I want to test for the PHP Dig spinder in the USER_AGENT header, but I have no idea what agent it will be? Any ideas? The user agent name is PhpDig.

06-27-2004, 07:24 AM
Thanks Pat - do you think thatthis code will return true then:

if(strpos($_SERVER["HTTP_USER_AGENT"],"PhpDig")!=FALSE) {
//set session information

06-27-2004, 07:56 AM
Yes, something like that will work but you can simplify it a bit, like so:if ($_SERVER["HTTP_USER_AGENT"] == 'PhpDig') {
//set session information
}BTW, I forgot to welcome you to the forum. Thanks for joining us! :D

06-27-2004, 08:41 AM
Thanks for the welcome. Testing for the user agent doesn't seem to be working (although it is hard to tell as I don't currently have access to the logs.....

Is there any other way that I could tell if a script is being called by PHPDig?

Also, when the spider does a crawl, it seems to dismiss dynamically generated pages that differ only in ids:

eg: ?page=articleView&articleId=250
giving Duplicate of an existing document

There are several thousand of these articles and it looks like none of them are being indexed....

06-27-2004, 09:18 AM
There are several threads here in the forum about problems similar to yours, like this one (http://www.phpdig.net/showthread.php?s=&threadid=712&highlight=numbers), for example. If that one doesn't provide some clues, just search the forum for any thread with the word "numbers" in it, and you'll probably find something that will help.

08-17-2004, 02:23 PM
I'm currently facing the same problem (indexing password protected files that don't use .htaccess protection) and found this thread very helpful.

I would like to point out to people that forging a user agent header is very easy, especially with browsers such as opera. If you are going to use the user agent as an authentication method you should edit spider.php and set the user agent to something else and then test for that.

Look for the following lines in admin/spider.php and change to something a little harder to guess:

// set the User-Agent for the file() function
@ini_set('user_agent','PhpDig/'.PHPDIG_VERSION.' (+http://www.phpdig.net/robot.php)');

08-18-2004, 09:57 AM
A follow up to my earlier post, to edit the spider user agent edit the file admin/robot_functions.php, not the spider.php file I mentioned earlier.

Another good idea would be to have a test in your site's authentication mechanism to check that the IP address of the spider is what you expect it to be, just in case.

- Ben