PDA

View Full Version : No indexing IIS 6 Win 2003 Server


Rolandks
09-19-2003, 03:44 AM
I spend many time to find out what the problems are with the NEW IIS 6 at Windows 2003 Server.

PHPDIG donĀ“t indexing IIS 6 Websites at the moment.

I also try to index a IIS 6 Sites from a Linux-System - same result. (email me, I sent you the web-page to test it.)

Results of indexing:

### IIS 6 - Log file ####
#Fields: date time c-ip c-session cs(Referer) sc-Protocol sc-uri sc-status
2003-09-18 19:41:27 62.142.48.115 1033 217.160.xx.xx 80 HTTP/1.1 HEAD /robots.txt 400 - BadRequest
2003-09-18 19:41:27 62.141.48.115 1034 217.160.xx.xx 80 HTTP/1.1 HEAD // 400 - BadRequest
2003-09-18 19:41:27 62.141.48.115 1035 217.160.xx.xx 80 HTTP/1.1 HEAD / 400 - BadRequest
2003-09-18 19:41:27 62.141.48.115 1036 1217.160.xx.xx 80 HTTP/1.1 HEAD /robots.txt 400 - BadRequest
op=HEAD arg=http://www.my-domain.de/ result="400 Bad Request"

## Windows 2003 Monitoring ###
<-> Filter: http
----------------------------------
HTTP: HEAD Request from Client
HTTP: Request Method =HEAD
HTTP: Uniform Resource Identifier =//
HTTP: Protocol Version =HTTP/1.1
HTTP: Host =www.my-domain.de
HTTP: Accept = */*
HTTP: Accept-Charset = iso-8859-1
HTTP: Accept-Encoding =identity
HTTP: User-Agent =PhpDig/1.6.2 (PHP; MySql)
------
HTTP: Response to Client; HTTP/1.1; Status Code = 400 - Bad Request
HTTP: Protocol Version =HTTP/1.1
HTTP: Status Code = Bad Request
HTTP: Reason =Bad Request
HTTP: Content-Length =20
HTTP: Content-Type =text/html
HTTP: Connection =close

I will also ask in a Win-Newsgroups to get the reasons for this.

I read some other problems with Error 400: does phpdig use allowed HTTP RFC Commands: see: RFC 2616 (http://www.w3.org/Protocols/rfc2616/rfc2616.html)

-Roland-

Charter
09-19-2003, 09:47 AM
Hi. With HEAD [your_site]/robots.txt HTTP/1.1 it produces the following:

Content-Length: 24

The robots.txt file contains the following:

User-agent: *
Disallow:

What happens if you just delete the robots.txt file?

What do you get?

Rolandks
09-19-2003, 10:13 AM
ok is deleted. You can try again. Its just the same in my tests.

-Roland-

Charter
09-19-2003, 10:22 AM
Hi. Please can you post the results like you did above? Maybe there will be something in there, or are the results just like those above?

Rolandks
09-19-2003, 12:24 PM
Hmm, Monitor-Log is only possible if i start this 2 sec before i dig.

This is wrong - IMHO !!
robot_functions.php Line 286


$request =
"HEAD $path HTTP/1.1\n"
."Host: $host$sport\n"
.$cookiesSendString
.$auth_string
."Accept: */*\n"
."Accept-Charset: ".Dig-Spider_ENCODING."\n"
."Accept-Encoding: identity\n"
."User-Agent: Dig-Spider/".Dig-Spider_VERSION." (PHP; MySql)\n\n";


The Header(lines) of the HEAD Requests are NOT split by CRLF only
with LF ('\n')? LF is wrong in RFC - Each header ends with a CRLF !!

See:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2


HTTP/1.1 defines the sequence CR LF as the end-of-line marker for all
protocol elements except the entity-body (see appendix 19.3 for
tolerant applications). The end-of-line marker within an entity-body
is defined by its associated media type, as described in section 3.7.

CRLF = CR LF


-Roland-

Charter
09-19-2003, 12:43 PM
Hi. I believe the problem is that the script uses \n and your machine needs \r\n.

Please try this to fix the problem: First make a backup of the robot_functions.php file. Then in robot_functions.php, do the following:


find:


$auth_string = 'Authorization: Basic '.base64_encode($components['user'].':'.$components['pass'])."\n";


and replace with:


$auth_string = 'Authorization: Basic '.base64_encode($components['user'].':'.$components['pass'])."\r\n";


find:


$cookiesSendString .= "Cookie: ".$cookieString['string']."\n";


and replace with:


$cookiesSendString .= "Cookie: ".$cookieString['string']."\r\n";


find:


@ini_set('user_agent','PhpDig/'.PHPDIG_VERSION.' (PHP; MySql)'."\n".phpDigMakeCookies($cookiesToSend,$path));


and replace with:


@ini_set('user_agent','PhpDig/'.PHPDIG_VERSION.' (PHP; MySql)'."\r\n".phpDigMakeCookies($cookiesToSend,$path));


find:


$request =
"HEAD $path HTTP/1.1\n"
."Host: $host$sport\n"
.$cookiesSendString
.$auth_string
."Accept: */*\n"
."Accept-Charset: ".PHPDIG_ENCODING."\n"
."Accept-Encoding: identity\n"
."User-Agent: PhpDig/".PHPDIG_VERSION." (PHP; MySql)\n\n";


and replace with:


$request =
"HEAD $path HTTP/1.1\r\n"
."Host: $host$sport\r\n"
.$cookiesSendString
.$auth_string
."Accept: */*\r\n"
."Accept-Charset: ".PHPDIG_ENCODING."\r\n"
."Accept-Encoding: identity\r\n"
."User-Agent: PhpDig/".PHPDIG_VERSION." (PHP; MySql)\r\n\r\n";


find:


$req1 = "HEAD $path HTTP/1.1\n"
."Host: $host$sport\n"
.$cookiesSendString
.$auth_string
."Accept: */*\n"
."Accept-Charset: ".PHPDIG_ENCODING."\n"
."Accept-Encoding: identity\n"
."User-Agent: PhpDig/".PHPDIG_VERSION." (PHP; MySql)\n\n";


and replace with:


$req1 = "HEAD $path HTTP/1.1\r\n"
."Host: $host$sport\r\n"
.$cookiesSendString
.$auth_string
."Accept: */*\r\n"
."Accept-Charset: ".PHPDIG_ENCODING."\r\n"
."Accept-Encoding: identity\r\n"
."User-Agent: PhpDig/".PHPDIG_VERSION." (PHP; MySql)\r\n\r\n";



I think that's all of them that absolutely need to be changed. I also think you could just do a search and replace, changing all \n to \r\n in the files.

As a general rule of thumb, I believe it's like this for different OS:

Windows uses \r\n
Macintosh uses \r
*nix uses \n

Charter
09-19-2003, 12:47 PM
Originally posted by Rolandks
Hmm, Monitor-Log is only possible if i start this 2 sec before i dig.

Question? Where is the relevant Line in Spider.php ?

Are the Header(lines) of the HEAD Requests split be CRLF or only
with LF ('\n')? LF is wrong in RFC - Each header ends with a CRLF !!

-Roland-

Ah, I see you were already thinking that. To test, I wrote a script to do a HEAD request on your machine. With only \n I received 400 Bad Request, but with \r\n it worked fine.

Rolandks
09-19-2003, 01:40 PM
Thanks :)

I think it should change in the next Version it is conform to RFC - and if users update they can fix this again

I wrote above:

See:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2


HTTP/1.1 defines the sequence CR LF as the end-of-line marker for all
protocol elements except the entity-body (see appendix 19.3 for
tolerant applications). The end-of-line marker within an entity-body
is defined by its associated media type, as described in section 3.7.
CRLF= CR LF


Microsoft IIS 6 is designed for NEW Security ;) and they use STRICT RFC and no tolerant applications.

-Roland-