wrapper/multiple spiders... [Archive]

View Full Version : wrapper/multiple spiders...

CentaurAtlas

11-16-2006, 07:52 AM

Since it won't let me reply to this thread, here goes here... ;-)
http://www.phpdig.net/forum/showthread.php?t=662

(I know that is an old thread, but I think people are still playing with the wrapper.php mod).

To answer some of the questions:

1. You should also comment out these two lines the beginning of this file:
//show_source("wrapper.php");
//exit;

2. You put wrapper.php in the admin directory.

3. You should log in to the command line (e.g. ssh or telnet) and then you cd to that directory:
cd /path-to-phpdig/admin/

4. Then you should run:
php -f wrapper.php

5. From the command line, this command will show you what is running:

screen -list

6. The current wrapper code calls for at most 6 spiders/threads going:
$threads = 6;

Obviously you can increase it, but I don't know how much improvement you'd get.

I've only been playing with this for a couple of days, but all in all, the whole phpdig package is pretty cool. The two things that I am looking at improving are:
1. Number of spiders, with multiple spiders, the speed increases. Has to be reliable though. So far it seems wrapper.php is doing ok in the reliability department. Where I saw people saying it re-queues sites, that could be fixed by having it check the date to re-spider...still looking into that to see if I see that issue...
2. Modularization. The main code isn't very encapsulated so it makes it more difficult to understand and modify without lots of side effects. ;-)

It is great to have this as an option, and hopefully it will keep getting improved. ;-)

CentaurAtlas

11-18-2006, 04:07 PM

I've been looking at the multiple spider issue and for me the impetus for doing multiple spiders is to spider more pages faster (obviously, I think).

After looking at the performance of the software, the slowness isn't waiting for the pages, but in processing them.

The performace is in the code that processes the URLs found per page. Perhaps everyone knew this, but it is illustrative to show where the bottleneck is.

In particular if you check the performance you will see that the
foreach($urls as $lien) { ...
}

loop takes the majority of the time. Around 50-60 seconds per page.

In adIdion to (or in instead of) throwing 50 or 60 spiders at a problem, improving the performance of this section would help improve indexing performance greatly.

More in a bit!

CentaurAtlas

11-18-2006, 04:25 PM

p.s. the loop I was referring to is in the spider.php file, in case that wasn't clear.

CentaurAtlas

11-19-2006, 08:10 AM

Here is some more information. The culprit is in the phpdigDetectDir routine inside robot_functions.php. For some pages it is running about 20-60 seconds when it calls
phpdigTestUrl which in turn calls fgets.

The slow part is in the fgets call, which is really strange that it is that slow on a fast connection (3 different fast connections - two on 100 mbit connections and one cable modem).

So, it is something with the network call afterall. I wrote some code in a different language that will grab a couple of pages per second.

Now I'm wondering if it is something throttling back the connection for a robot...because it should not be that slow.

CentaurAtlas

11-19-2006, 09:24 AM

This is in 1.8.8 since mb_eregi isn't working, but the code there looks the same in 1.8.9rc1.

The odd thing is that the delay does not occur in phpdigGetUrl, pages are fetched very quickly there from my experiments.

I am wondering if it is a time-out issue in here for some reason.

I don't see it with all sites, but only with some (e.g. www.dmoz.com)

Using a stream_set_timeout with a 5 second value before that call seems to be helping right at the moment.

CentaurAtlas

11-19-2006, 12:07 PM

Ok, it does the same thing in 1.8.9rc1. I downgraded 1.8.9rc to use the eregi for 1.8.8 so it is kind of a hybrid between the 1.8.8 and 1.8.9rc1.

The odd thing is that that is the only place that I can see that huge amount of time being used for fgets.