PDA

View Full Version : https phpdig strande failure


tomas
03-10-2004, 02:56 PM
digging a website with some https-forms produces strange failure:

some forms on this website force ssl this way:

if ($_SERVER["SERVER_PORT"]=="80") {
$ssl_redirect = "https://foo.com" . $_SERVER["SCRIPT_NAME"];
header("Location: $ssl_redirect");
exit;
}


this leads to:

<b>Warning</b>: mysql_num_rows(): supplied argument is not a valid MySQL result resource in <b>/search/admin/spider.php</b> on line <b>485</b><br />
<br />
<b>Warning</b>: mysql_num_rows(): supplied argument is not a valid MySQL result resource in <b>/search/admin/spider.php</b> on line <b>503</b><br />

about 5 apache processes, consuming about 80% memory remain for hours and spidering fails.

our workaround was:

if (($_SERVER["SERVER_PORT"]=="80") and (strstr(strtolower($_SERVER["HTTP_USER_AGENT"]), mozilla)!="")) {
$ssl_redirect = "https://foo.com" . $_SERVER["SCRIPT_NAME"];
header("Location: $ssl_redirect");
exit;
}


but on websites you have no access to scripts the bug will make it impossible to spider the site.

any ideas - anyone
tomas:confused::confused::confused:

Charter
03-10-2004, 04:35 PM
<b>Warning</b>: mysql_num_rows(): supplied argument is not a valid MySQL result resource in <b>/search/admin/spider.php</b> on line <b>485</b><br />
<br />
<b>Warning</b>: mysql_num_rows(): supplied argument is not a valid MySQL result resource in <b>/search/admin/spider.php</b> on line <b>503</b><br />

Hi. When you echo the queries right before lines 485 and 503 what do you get?

tomas
03-10-2004, 04:47 PM
hallo charter,

there are no echos before -
watching spider i can see how the script digs the links from the homepage and doing the +++ for finding links - then at the first link with auto-redirect to ssl spider crashes with these messages.

tested on debian/fedora/redhat-enterprise with php 4.2.3/4.3.3/4.3.4 apache 1.3/2 cgi and mod

so i'm shure it's not a platform specific error

tomas

Charter
03-10-2004, 04:53 PM
Hi. If you stick the echo statements into the code, what does it echo for the two queries?

echo $query . " : query1";
echo $query . " : query2";

tomas
03-10-2004, 04:56 PM
these echos are taken from standard log output -
no extra echos were inserted.

t.

Charter
03-10-2004, 04:59 PM
Hi. Right, but echo the queries and see what the prints onscreen. The error "warning: mysql_num_rows(): supplied argument is not a valid MySQL result resource" means that the query is messed. If you echo the two $query variables, then you might see how to fix it.

tomas
03-10-2004, 05:06 PM
ok i will try it tomorrow,

because as on my previous post the live server of this website we did the patch for our code - so i have to setup all this again
on one of the test-machines and start running the spider.

but for now it's 2 hours past midnight and i'll leave the office
for today - puuuh.

by the way - i sent you an email :-)

tomas

Charter
03-10-2004, 05:19 PM
Hi. Okay, but one thing to check is to see if the following code is still in the spider.php file, right before the "//is this link already in temp table" comment:

if (!get_magic_quotes_runtime()) {
$lien['path'] = addslashes($lien['path']);
$lien['file'] = addslashes($lien['file']);
}

As both queries as nearly the same, my initial guess would be that $lien['path'] and/or $lien['file'] are no longer being escaped and so the queries are breaking and mysql_num_rows has nothing to 'num' on.

tomas
03-12-2004, 05:11 PM
hi charter,

sorry for my delay -
tried to insert the echos - but nothing to catch the bug inside
all the sql looks good - but at the last link right before the first of the 'ssl-ones' echoing stops - and the server starts getting hot (about 80% mem and cpu constant)

funny thing and i have no idea
do you have :-)

tomas

Charter
03-12-2004, 06:53 PM
Hi. I think I know what's going on. I don't believe PhpDig was originally written to crawl https and so the header redirect sends http to https but PhpDig does it's thing and https goes back to http but then the redirect sends it back to https, etcetera, so I think it's a loop.

If you search for http in the spider.php and robot_functions.php files, and maybe other files, you'll see that some code needs to change in order to account for https links. I'm not going to post a bunch of trial and error code now, but will instead work on it for inclusion in another release.

In the meantime, try sending the PhpDig robot a 403 when https is encountered based on user agent or something. Not tested, but if you want to crawl https one at a time via command line, the following might be okay:

In spider.php change:

if (ereg('^http://',$argv[1])) {

to the following:

if (ereg('^http[s]?://',$argv[1])) {

and then use the following:

prompt> php -f spider.php https://www.domain.com

thowden
09-21-2004, 09:36 PM
Hi

This seems to be the appropriate thread for a follow up on https indexing. Here's where I am at and what I have done trying to get phpdig to function on my server.

Config Redhat based e-smith server with Apache

Symptoms:
test mode is
php -f spider.php https://my.server.com

results in

4094: old priority 0, new priority 18
Spidering in progress...
-----------------------------
SITE : https://my.server.com/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0
...Was recently indexed
Optimizing tables...
Indexing complete !

/var/log/httpd/error_log shows group of 4 errors for each spider attempt

[Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt
[Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1
[Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1
[Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt


From the rest of this thread I started looking at how it works.

In my server the site is forced to https and is a virtual host structure within apache.

The first issue that I can see is that the robots.txt file is being looked for within the root of the server web sites rather than the location of the virtual site and why is it looking at it in a file structure rather than as a url ?
The second is the erroneous characters ?

Note that if I vary the url to use an http:// access point for the site as a sub-url of the server primary site, then phpdig does a full index. So phpdig and php are configured and will work for sites other than https.

Issue #1

Line 869 in robot_functions line is forcing the search for robots.txt back to an http which, in our case, defaults to a different site on the web server.

$site = eregi_replace("^https","http",$site);

Taking out line 869 with // provides a different set of error log messages

[Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD /robots.txt HTTPS/1.1
[Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1
[Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1
[Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD /robots.txt HTTPS/1.1

Which would indicate that the site line should be left out as all four errors are now consistent with Issue #2. A side issue is the question over why robots.txt would be called twice ?

Issue #2

The fact that the request for robots.txt is also given as having erroneous characters I looked first at the line 272 of robot_functions.php - function phpdigTestUrl()

In this function the END_OF_LINE_MARKER is added to every HEAD request

Changed the value for END_OF_LINE_MARKER from \r\n to \n

define("END_OF_LINE_MARKER","\n");

Just in case this was the problem, it wasn't and showed no effect on the error messages. Left it as \n for linux.

Next I checked the HEAD and GET constructs and confirmed that they do not accept HTTPS/1.1

Line 347 "HEAD $path $http_scheme/1.1".END_OF_LINE_MARKER

line 621 "GET $path $http_scheme/1.1".END_OF_LINE_MARKER

changed to

Line 347 "HEAD $path HTTP/1.1".END_OF_LINE_MARKER

line 621 "GET $path HTTP/1.1".END_OF_LINE_MARKER

seems that it always wants HTTP/1.1 regardless of HTTPS or HTTP.

I checked W3C for any reference to HTTP vs HTTPS and HTTPS does not exist. From my quick read it seems that RFC2818 HTTP over TLS is related and that HTTP is HTTP whereas HTTPS is really HTTP over TLS instead of TCP. But I'll stop there 'cos I dont really need to go that deep.

That fixed the erroneous characters issue, but the error logs are now showing the wrong robots.txt file is being looked for again, is this because of this change or because I didn't really fix it earlier?

[Wed Sep 22 11:09:27 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt
[Wed Sep 22 11:09:27 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt

from the access log it shows

www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD /robots.txt HTTP/1.1" 404 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)"
www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD / HTTP/1.1" 200 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)"
www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD / HTTP/1.1" 200 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)"
www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD /robots.txt HTTP/1.1" 404 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)"

Which would indicate that the script is still searching at the root rather than the virtual host url. As a quick test I used a browser to look at the robots.txt file and checked the access log and that works fine. So it has to be related to how phpdig is calling the robots.txt under https. Another point is that the host name is not shown as the virtualhost name so something is getting lost in translation.

I am now stuck, is this a phpdig issue, or an apache issue, or something really obscure. :(

I have two sites that I need to index that are locked into being only https access. The http sites on the same server are fine.

The https site now only gets a single link indexed and that is the topmost self link to the domain. Yet doing the same site via http for a depth of 3/3 gets 10 entries.

cheers
Tony

http://www.marblebay.com.au

Charter
12-04-2004, 07:17 PM
>> I am now stuck, is this a phpdig issue, or an apache issue, or something really obscure.

tomas, thowden, you still around?