PDA

View Full Version : speciffically slow spidering at fgets()


slintz
08-10-2004, 12:28 PM
I've read the other posts re: slow spidering behavior and found nothing matching my situation. Please help!

After inserting traces and such into the code, I've found a consistent delay of 10 - 15 seconds for each page being indexed which occurs across a specific function call:


FILE: robot_functions.php
FUNCTION: phpdigGetUrl()
STATEMENT: $answer = fgets($fp,8192);


(I've reformatted the code substantially, so I can't provide a specific line number. The fgets() occurs close to the top of the while (!$stop && !feof($fp)) { ... })


OS: Win 2000
HTTPD: Apache 2.0.49 (Win32)
PHP: 5.0.0
MYSQL: 4.1.3b-beta
PHPDIG: 1.8.3


As a speed check, I ran wget (cygwin) to mirror a piece of my own local site to my own drive. PhpDig took about 4 minutes to index what wget did in less than 10. Although they do different things, spidering and wget'ing are very similar which indicates that a 25:1 timing differential should not be expected...

Thanks much!

slintz
08-10-2004, 12:32 PM
PS - One more helpful(?) bit of info: while PhpDig spidering is going on, I've watched my CPU activity which is mostly nothing, with occasional spikes (every 10 - 15 seconds, BTW). To me, this points to a timeout issue - but I don't know where / what layer to consider. (Also, I've reduced all PhpDig sleeps to 1 or 2 seconds and this is NOT the problem at all). Thanks again!

vinyl-junkie
08-10-2004, 05:08 PM
Are you able to spider from shell (http://www.phpdig.net/navigation.php?action=doc#toc8)? That might be a way around the problem.

slintz
08-10-2004, 06:37 PM
Vinyl J -

Good idea (and it made me solve some incidental installation problems), yet no go (i.e. same problem and with harder-to-read output <lol>). Anyway, as I mentioned above, the wget mirroring program doesn't have any trouble like this - it's quite zippy! That points away from the httpd software / configuration. It has all the smell of a communication timeout issue, but how do I investigate beyond the sticking fgets() ?

Charter
08-15-2004, 03:05 PM
Hi. Can't say I've experienced fgets problems. Perhaps something here (http://bugs.php.net/search.php?cmd=display&search_for=fgets) might help?

slintz
08-16-2004, 01:17 PM
Well, I've exactly found the problem: the code doesn't respect the Content-Length header (or when chunked, the chunk sizes). Thus, it will always attempt an over-read. I suppose on some configurations that doesn't make a difference, but on mine it surely does! I've fully solved the problem in the test script and partially moved that solution into my own PhpDig code. If anyone cares to know more, get in touch...

Cheers!

Charter
08-17-2004, 03:02 PM
Will you post your mod in the Mod Submissions (http://www.phpdig.net/forumdisplay.php?forumid=24) forum?

jinkas
08-18-2004, 02:24 AM
just to throw in my two cents worth...

i'm already communicating with slintz, but this isn't a problem specific only to him...the exact same thing happens to me when i try and spider my site...i always get between 10-15 seconds (sometimes up to 20) of delay / page

here is my server info:

OS: Solaris 5.8
PHP: 4.3.8
Apache: 2.0.50
MySQL: 4.0.13
PhpDig: 1.8.3

yes, i realize that some of those are older versions, but i have no control over that...i just write the webpages :)