PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Mod Submissions

Reply
 
Thread Tools
Old 04-15-2004, 05:52 PM   #1
Carl Mikkelsen
Green Mole
 
Join Date: Apr 2004
Location: Wayland, MA, USA
Posts: 8
Angry robots.txt not fully honored

I have been deploying phpdig as a test or our intranet. Aside from encountering a php segmentation violation when parsing a cookie, the largest problem I've had is with processing of robots.txt.

Our intranet is almost entirely dynamic content -- much of what I want to index is delivered by TWiki, a collaboration tool distributed from www.twiki.org. In robot_functions.php, a test is made for each URL encountered to determine if it should be indexed. The logic for this test causes an HTTP HEAD request to be issued, I think to determine the content type. This HEAD request is issued without regard to the robots.txt file.

If the content type is appropriate, the robots.txt defined exclusions are tested.

Unfortunately, the HEAD request causes the content for a page to be generated. In some cases, that generation can be VERY lengthly. To all appearances, the indexing is stopped. The HTTP server grinds to a halt computing page content that is never used.

The "fix" I'm testing is to move the check of robots.txt to the beginning of the function. Iff the file is not excluded, then the content type can be tested as before.

If this is causing trouble for anyone else, and if the developers concur, I could post a patch.
__________________
Carl Mikkelsen
www.foxkid.net
Carl Mikkelsen is offline   Reply With Quote
Old 04-15-2004, 06:09 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Sure, post away. Also, you might be interested in this change.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 04-22-2004, 08:07 AM   #3
bakker
Green Mole
 
Join Date: Apr 2004
Posts: 1
I'm having a similar problem and would be interested in your patch.
bakker is offline   Reply With Quote
Old 04-29-2004, 07:12 AM   #4
Carl Mikkelsen
Green Mole
 
Join Date: Apr 2004
Location: Wayland, MA, USA
Posts: 8
Patchs related to robots.txt not fully honored.

Attached is a patch file which could be applied to the robot_functions.php file includes with the phpdig distribution this morning.

This includes also changes to handle MSPOWERPOINT (which should be matched with declarations in includes/config.php), as well as some fixes inherited from an alternate robot_functions.php I downloaded.

There is one change where the DOMAIN field in cookies was causing php to crash. I removed the DOMAIN processing (without understanding the intent) which could cause problems. You can back out that change.

The main change is to move the robots.txt processing earlier, so that the http HEAD request is not performed.

I also fixed what seemed to this php novice to be problems escaping some characters in the robots.txt parsing. With this change, phpdig accepts "*" as a meta-character in robots.txt, allowing entries such as:
Disallow: dynamic-content/view*parm=

As I am unfamiliar with php, I'm asking both for php-related feedback, and for comments related to the intent of the changes.

Thanks,

-- Carl
Attached Files
File Type: txt robot_functions.diff.txt (9.8 KB, 52 views)
__________________
Carl Mikkelsen
www.foxkid.net
Carl Mikkelsen is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
robots.txt seems to be ignored :? galacticvoyager Bug Tracker 1 11-12-2005 12:52 PM
robots.txt and URL djavet How-to Forum 4 01-11-2005 03:19 AM
robots.txt versus robotsxx.txt Charter IPs, SEs, & UAs 0 03-11-2004 06:00 PM
robots.txt ignored roy Troubleshooting 3 02-20-2004 08:02 PM
robots.txt renehaentjens Troubleshooting 3 12-05-2003 02:40 PM


All times are GMT -8. The time now is 01:17 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.