PDA

View Full Version : Fix for slow spidering in PhpDig 1.8.x


vital
11-01-2004, 09:26 AM
Some people using PhpDig reported severe perfomance impact when moving from version 1.6.x to 1.8.x.
I was one of them. It could take more the 20 seconds for me to index a single page with version 1.8.x. PhpDig 1.6.x indexed the same page in less then a second.
This happened only under win32.

Finally I found the cause of it. Look at the following code:

function phpdigGetUrl($url,$cookies=array()) {
/* cut */
$fp = @fsockopen($host,$port);
/* cut */
//complete get
$request =
"GET $path $http_scheme/1.1".END_OF_LINE_MARKER
."Host: $host$sport".END_OF_LINE_MARKER
.$cookiesSendString
.$auth_string
."Accept: */*".END_OF_LINE_MARKER
."Accept-Charset: ".PHPDIG_ENCODING.END_OF_LINE_MARKER
."Accept-Encoding: identity".END_OF_LINE_MARKER
."User-Agent: PhpDig/".PHPDIG_VERSION." (+http://www.phpdig.net/robot.php)".END_OF_LINE_MARKER.END_OF_LINE_MARKER;

fputs($fp,$request);

//get return page
/* cut */
while (!$stop && !feof($fp)) {
$flag_to_stop_loop++;
$answer = fgets($fp,8192)
/* cut */
if ($flag_to_stop_loop == 10000) { break; }
/* cut */
}
/* cut */
}


Spider opens a connection to site using fsockopen(), then sends GET-request and starts reading 8K blocks from socket. The problem here is in HTTP header itself.

As stated in HTTP/1.1 http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html (specification)
HTTP/1.1 applications that do not support persistent connections MUST include the "close" connection option in every message.

In our case connection stays open forever and feof() never returns TRUE. So in worst case "while" iterates 10000 times, hence comes the delay.

To fix this search for the following in robot_functions.php file:
."User-Agent: PhpDig/".PHPDIG_VERSION." (+http://www.phpdig.net/robot.php)".END_OF_LINE_MARKER.END_OF_LINE_MARKER;
There will be a total of 3 matches (lines 354, 460 and 628)

Insert before each of them this line:
."Connection: close".END_OF_LINE_MARKER

Hope this will help.

manfred
11-01-2004, 11:20 PM
This correction will make a big difference for spidering speed. I suggest that everybody (and Charter) will implement this immediately! Thank you very much.

-m-

funsutton
11-03-2004, 10:48 AM
Wow, I think that fix really helped me. I applied it and it looks good!

Great work!

-Brian Sutton
http://www.piedmontswingdance.org/search

AllKnightAccess
11-06-2004, 10:33 AM
I tried it, but I am not seeing any faster results. According to my log file, pages at my site are still being indexed at around 60 - 90 seconds per page.