PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Bug Tracker (http://www.phpdig.net/forum/forumdisplay.php?f=27)
-   -   Fix for slow spidering in PhpDig 1.8.x (http://www.phpdig.net/forum/showthread.php?t=1484)

vital 11-01-2004 09:26 AM

Fix for slow spidering in PhpDig 1.8.x
 
Some people using PhpDig reported severe perfomance impact when moving from version 1.6.x to 1.8.x.
I was one of them. It could take more the 20 seconds for me to index a single page with version 1.8.x. PhpDig 1.6.x indexed the same page in less then a second.
This happened only under win32.

Finally I found the cause of it. Look at the following code:
PHP Code:

function phpdigGetUrl($url,$cookies=array()) {
/* cut */
$fp = @fsockopen($host,$port);
/* cut */
   //complete get
  
$request =
  
"GET $path $http_scheme/1.1".END_OF_LINE_MARKER
  
."Host: $host$sport".END_OF_LINE_MARKER
  
.$cookiesSendString
  
.$auth_string
  
."Accept: */*".END_OF_LINE_MARKER
  
."Accept-Charset: ".PHPDIG_ENCODING.END_OF_LINE_MARKER
  
."Accept-Encoding: identity".END_OF_LINE_MARKER
  
."User-Agent: PhpDig/".PHPDIG_VERSION." (+http://www.phpdig.net/robot.php)".END_OF_LINE_MARKER.END_OF_LINE_MARKER;

    
fputs($fp,$request);

       
//get return page
/* cut */
    
while (!$stop && !feof($fp)) {
          
$flag_to_stop_loop++;
          
$answer fgets($fp,8192)
/* cut */
          
if ($flag_to_stop_loop == 10000) { break; }
/* cut */
}
/* cut */


Spider opens a connection to site using fsockopen(), then sends GET-request and starts reading 8K blocks from socket. The problem here is in HTTP header itself.

As stated in HTTP/1.1 http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
Quote:

HTTP/1.1 applications that do not support persistent connections MUST include the "close" connection option in every message.
In our case connection stays open forever and feof() never returns TRUE. So in worst case "while" iterates 10000 times, hence comes the delay.

To fix this search for the following in robot_functions.php file:
."User-Agent: PhpDig/".PHPDIG_VERSION." (+http://www.phpdig.net/robot.php)".END_OF_LINE_MARKER.END_OF_LINE_MARKER;
There will be a total of 3 matches (lines 354, 460 and 628)

Insert before each of them this line:
."Connection: close".END_OF_LINE_MARKER

Hope this will help.

manfred 11-01-2004 11:20 PM

This correction will make a big difference for spidering speed. I suggest that everybody (and Charter) will implement this immediately! Thank you very much.

-m-

funsutton 11-03-2004 10:48 AM

Wow, I think that fix really helped me. I applied it and it looks good!

Great work!

-Brian Sutton
http://www.piedmontswingdance.org/search

AllKnightAccess 11-06-2004 10:33 AM

I tried it, but I am not seeing any faster results. According to my log file, pages at my site are still being indexed at around 60 - 90 seconds per page.


All times are GMT -8. The time now is 10:24 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.