View Single Post
Old 11-01-2004, 09:26 AM   #1
vital
Green Mole
 
Join Date: Jul 2004
Posts: 2
Lightbulb Fix for slow spidering in PhpDig 1.8.x

Some people using PhpDig reported severe perfomance impact when moving from version 1.6.x to 1.8.x.
I was one of them. It could take more the 20 seconds for me to index a single page with version 1.8.x. PhpDig 1.6.x indexed the same page in less then a second.
This happened only under win32.

Finally I found the cause of it. Look at the following code:
PHP Code:
function phpdigGetUrl($url,$cookies=array()) {
/* cut */
$fp = @fsockopen($host,$port);
/* cut */
   //complete get
  
$request =
  
"GET $path $http_scheme/1.1".END_OF_LINE_MARKER
  
."Host: $host$sport".END_OF_LINE_MARKER
  
.$cookiesSendString
  
.$auth_string
  
."Accept: */*".END_OF_LINE_MARKER
  
."Accept-Charset: ".PHPDIG_ENCODING.END_OF_LINE_MARKER
  
."Accept-Encoding: identity".END_OF_LINE_MARKER
  
."User-Agent: PhpDig/".PHPDIG_VERSION." (+http://www.phpdig.net/robot.php)".END_OF_LINE_MARKER.END_OF_LINE_MARKER;

    
fputs($fp,$request);

       
//get return page
/* cut */
    
while (!$stop && !feof($fp)) {
          
$flag_to_stop_loop++;
          
$answer fgets($fp,8192)
/* cut */
          
if ($flag_to_stop_loop == 10000) { break; }
/* cut */
}
/* cut */

Spider opens a connection to site using fsockopen(), then sends GET-request and starts reading 8K blocks from socket. The problem here is in HTTP header itself.

As stated in HTTP/1.1 http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
Quote:
HTTP/1.1 applications that do not support persistent connections MUST include the "close" connection option in every message.
In our case connection stays open forever and feof() never returns TRUE. So in worst case "while" iterates 10000 times, hence comes the delay.

To fix this search for the following in robot_functions.php file:
."User-Agent: PhpDig/".PHPDIG_VERSION." (+http://www.phpdig.net/robot.php)".END_OF_LINE_MARKER.END_OF_LINE_MARKER;
There will be a total of 3 matches (lines 354, 460 and 628)

Insert before each of them this line:
."Connection: close".END_OF_LINE_MARKER

Hope this will help.

Last edited by vital; 11-01-2004 at 09:54 AM.
vital is offline   Reply With Quote