Forking when spidering [Archive]

View Full Version : Forking when spidering

obottek

02-14-2004, 04:15 AM

I spider a lot of sites with phpdig, which works pretty good. But sometimes it takes really a lot of time, especially, when 404 occurs (see my remark on that on the bugs/problems forum).

EDIT: Threads merged. See next post in this thread.

Also of course, the spider has to live with the speed of the delivered websites. So if you have a site, where the webserver is configured to allow a maximum throughput per connection, you might end up waiting years for a response and that over and over again for each page.

An idea would be to fork the spider process. So instead of one spider process working all items one after the other, it would be great to run multiple spider processes, which each pick the next available site/page/document and so on and spider it. This would really increase the speed when spidering multiple sites and so on dramatically.

I don't know what happens, when I start a second spider process on the command line, so maybe that is already possible. Any ideas and details on that?

Greetings,
Olaf

obottek

02-14-2004, 04:19 AM

I got the problem, that some documents, which are getting 404 on a re-spidering take really minutes (about 7 minutes) before they are recognized as 404. Is there a way to reduce this time, since this is really blocking the whole spidering process. Image 7 minutes for each (not anymore existing) page...

Greetings,
Olaf

obottek

02-14-2004, 04:59 AM

Just run a test with two sites to spider and two parallel spiders. Well the good news is that can pretty good work next to eachother. The first process picks the first site, the second see's that the first site is locked and picks the second site.

So far, so good. But the locked sites are remembered and not skipped. So afterall both spider processes will spider everything, which does not save time at all in the end but produces double traffic.

So I would suggest to allow the spider processes simply to skip locked sites (maybe as an option, to configure in the config.inc). That would lead to the functionality of running multiple spider processes, which would take phpdig a huge step further especially when spidering a lot sites.

By the way, I receive the following error messages at the end of both spider processes - guess that's nothing to worry (Suse 7.2):
file.c(384) : Freeing 0x083ED134 (10 bytes), script=spider.php
Last leak repeated 70 times
file.c(351) : Freeing 0x083FF584 (7 bytes), script=spider.php
Last leak repeated 84 times

Olaf

Charter

02-14-2004, 03:44 PM

Hi. Thinking off the top of my head and untested but maybe you could setup different tempsider tables, using shell and a text file with different URLs for each robot, modifying the script as needed to use the appropriate tempspider table. Also untested but maybe set some kind of timeout in the phpdigTestUrl function so that each spider can skip URLs where the response of the delivered website takes too long.

obottek

02-16-2004, 01:56 AM

Multiple spider process
Maybe that all is a little to complicated thought. Think of phpdig would allow multiple spider-processes in general, okay. That would mean, that one would define the amount of spider processes in the config file. Now we would need a sort of starter scripts, which calls the n spider processes. This starter script should generate a unique id (let's say md5-hash of timestamp and a r****m number to spilt parallel jobs). Now the temp-spider table get's a new column, which will be filled by the spider processes with that unique id. Now the spider processes should look for all records in temp, which do not have that unique id and which are not blocked. Blocked sites should be simply skipped. The spider processes should then of course only work down records from the temp table of the current site (site id is given there, so that shouldn't be a problem).

That way, multiple spider processes would spider next to eachother and would step by step work down the site table. So each site would be spidered by one process only, but multiple sites would be spidered by the first available spider process and would therefore work more or less parallel which increases speed.

Timeout for connections
It's a pitty, that you merged the threads, because they handle complete different themes.

However, timeout could be defined on fsockopen's (socket_set_timeout). So maybe that can be integrated into the config in the next release. I mean, you know better, where you have used those commands.

Greetings,
Olaf

cybercox

03-13-2004, 11:38 AM

Hi obottek
I am also interested in forking.
:bang:
I've just read the php magazine article on forking and
i'm very decided to apply such tecnology
to the spidering process.
I think that forking could be a very injection of speed to the spidering process...
If anyone is interested.... well i'm waiting
for yours comments and suggestions.....

Simone Capra
capra_nospam@erweb.it
(remove _nospam)
http://www.erweb.it