View Single Post
Old 09-21-2004, 08:36 PM   #11
thowden
Green Mole
 
Join Date: Sep 2004
Posts: 2
HTTP vs HTTPS indexing

Hi

This seems to be the appropriate thread for a follow up on https indexing. Here's where I am at and what I have done trying to get phpdig to function on my server.

Config Redhat based e-smith server with Apache

Symptoms:
test mode is
php -f spider.php https://my.server.com

results in

4094: old priority 0, new priority 18
Spidering in progress...
-----------------------------
SITE : https://my.server.com/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0
...Was recently indexed
Optimizing tables...
Indexing complete !

/var/log/httpd/error_log shows group of 4 errors for each spider attempt

[Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt
[Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1
[Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1
[Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt


From the rest of this thread I started looking at how it works.

In my server the site is forced to https and is a virtual host structure within apache.

The first issue that I can see is that the robots.txt file is being looked for within the root of the server web sites rather than the location of the virtual site and why is it looking at it in a file structure rather than as a url ?
The second is the erroneous characters ?

Note that if I vary the url to use an http:// access point for the site as a sub-url of the server primary site, then phpdig does a full index. So phpdig and php are configured and will work for sites other than https.

Issue #1

Line 869 in robot_functions line is forcing the search for robots.txt back to an http which, in our case, defaults to a different site on the web server.

$site = eregi_replace("^https","http",$site);

Taking out line 869 with // provides a different set of error log messages

[Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD /robots.txt HTTPS/1.1
[Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1
[Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1
[Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD /robots.txt HTTPS/1.1

Which would indicate that the site line should be left out as all four errors are now consistent with Issue #2. A side issue is the question over why robots.txt would be called twice ?

Issue #2

The fact that the request for robots.txt is also given as having erroneous characters I looked first at the line 272 of robot_functions.php - function phpdigTestUrl()

In this function the END_OF_LINE_MARKER is added to every HEAD request

Changed the value for END_OF_LINE_MARKER from \r\n to \n

define("END_OF_LINE_MARKER","\n");

Just in case this was the problem, it wasn't and showed no effect on the error messages. Left it as \n for linux.

Next I checked the HEAD and GET constructs and confirmed that they do not accept HTTPS/1.1

Line 347 "HEAD $path $http_scheme/1.1".END_OF_LINE_MARKER

line 621 "GET $path $http_scheme/1.1".END_OF_LINE_MARKER

changed to

Line 347 "HEAD $path HTTP/1.1".END_OF_LINE_MARKER

line 621 "GET $path HTTP/1.1".END_OF_LINE_MARKER

seems that it always wants HTTP/1.1 regardless of HTTPS or HTTP.

I checked W3C for any reference to HTTP vs HTTPS and HTTPS does not exist. From my quick read it seems that RFC2818 HTTP over TLS is related and that HTTP is HTTP whereas HTTPS is really HTTP over TLS instead of TCP. But I'll stop there 'cos I dont really need to go that deep.

That fixed the erroneous characters issue, but the error logs are now showing the wrong robots.txt file is being looked for again, is this because of this change or because I didn't really fix it earlier?

[Wed Sep 22 11:09:27 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt
[Wed Sep 22 11:09:27 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt

from the access log it shows

www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD /robots.txt HTTP/1.1" 404 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)"
www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD / HTTP/1.1" 200 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)"
www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD / HTTP/1.1" 200 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)"
www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD /robots.txt HTTP/1.1" 404 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)"

Which would indicate that the script is still searching at the root rather than the virtual host url. As a quick test I used a browser to look at the robots.txt file and checked the access log and that works fine. So it has to be related to how phpdig is calling the robots.txt under https. Another point is that the host name is not shown as the virtualhost name so something is getting lost in translation.

I am now stuck, is this a phpdig issue, or an apache issue, or something really obscure.

I have two sites that I need to index that are locked into being only https access. The http sites on the same server are fine.

The https site now only gets a single link indexed and that is the topmost self link to the domain. Yet doing the same site via http for a depth of 3/3 gets 10 entries.

cheers
Tony

http://www.marblebay.com.au
thowden is offline   Reply With Quote