PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Mod Requests

Reply
 
Thread Tools
Old 02-14-2004, 04:15 AM   #1
obottek
Green Mole
 
Join Date: Sep 2003
Posts: 15
Forking when spidering

I spider a lot of sites with phpdig, which works pretty good. But sometimes it takes really a lot of time, especially, when 404 occurs (see my remark on that on the bugs/problems forum).

EDIT: Threads merged. See next post in this thread.

Also of course, the spider has to live with the speed of the delivered websites. So if you have a site, where the webserver is configured to allow a maximum throughput per connection, you might end up waiting years for a response and that over and over again for each page.

An idea would be to fork the spider process. So instead of one spider process working all items one after the other, it would be great to run multiple spider processes, which each pick the next available site/page/document and so on and spider it. This would really increase the speed when spidering multiple sites and so on dramatically.

I don't know what happens, when I start a second spider process on the command line, so maybe that is already possible. Any ideas and details on that?

Greetings,
Olaf
obottek is offline   Reply With Quote
Old 02-14-2004, 04:19 AM   #2
obottek
Green Mole
 
Join Date: Sep 2003
Posts: 15
Timeout setting on 404

I got the problem, that some documents, which are getting 404 on a re-spidering take really minutes (about 7 minutes) before they are recognized as 404. Is there a way to reduce this time, since this is really blocking the whole spidering process. Image 7 minutes for each (not anymore existing) page...

Greetings,
Olaf
obottek is offline   Reply With Quote
Old 02-14-2004, 04:59 AM   #3
obottek
Green Mole
 
Join Date: Sep 2003
Posts: 15
Just run a test with two sites to spider and two parallel spiders. Well the good news is that can pretty good work next to eachother. The first process picks the first site, the second see's that the first site is locked and picks the second site.

So far, so good. But the locked sites are remembered and not skipped. So afterall both spider processes will spider everything, which does not save time at all in the end but produces double traffic.

So I would suggest to allow the spider processes simply to skip locked sites (maybe as an option, to configure in the config.inc). That would lead to the functionality of running multiple spider processes, which would take phpdig a huge step further especially when spidering a lot sites.

By the way, I receive the following error messages at the end of both spider processes - guess that's nothing to worry (Suse 7.2):
file.c(384) : Freeing 0x083ED134 (10 bytes), script=spider.php
Last leak repeated 70 times
file.c(351) : Freeing 0x083FF584 (7 bytes), script=spider.php
Last leak repeated 84 times

Olaf
obottek is offline   Reply With Quote
Old 02-14-2004, 03:44 PM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Thinking off the top of my head and untested but maybe you could setup different tempsider tables, using shell and a text file with different URLs for each robot, modifying the script as needed to use the appropriate tempspider table. Also untested but maybe set some kind of timeout in the phpdigTestUrl function so that each spider can skip URLs where the response of the delivered website takes too long.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-16-2004, 01:56 AM   #5
obottek
Green Mole
 
Join Date: Sep 2003
Posts: 15
Multiple spider process
Maybe that all is a little to complicated thought. Think of phpdig would allow multiple spider-processes in general, okay. That would mean, that one would define the amount of spider processes in the config file. Now we would need a sort of starter scripts, which calls the n spider processes. This starter script should generate a unique id (let's say md5-hash of timestamp and a r****m number to spilt parallel jobs). Now the temp-spider table get's a new column, which will be filled by the spider processes with that unique id. Now the spider processes should look for all records in temp, which do not have that unique id and which are not blocked. Blocked sites should be simply skipped. The spider processes should then of course only work down records from the temp table of the current site (site id is given there, so that shouldn't be a problem).

That way, multiple spider processes would spider next to eachother and would step by step work down the site table. So each site would be spidered by one process only, but multiple sites would be spidered by the first available spider process and would therefore work more or less parallel which increases speed.

Timeout for connections
It's a pitty, that you merged the threads, because they handle complete different themes.

However, timeout could be defined on fsockopen's (socket_set_timeout). So maybe that can be integrated into the config in the next release. I mean, you know better, where you have used those commands.

Greetings,
Olaf
obottek is offline   Reply With Quote
Old 03-13-2004, 11:38 AM   #6
cybercox
Green Mole
 
Join Date: Jan 2004
Location: Italy
Posts: 11
Hi obottek
I am also interested in forking.

I've just read the php magazine article on forking and
i'm very decided to apply such tecnology
to the spidering process.
I think that forking could be a very injection of speed to the spidering process...
If anyone is interested.... well i'm waiting
for yours comments and suggestions.....

Simone Capra
capra_nospam@erweb.it
(remove _nospam)
http://www.erweb.it
cybercox is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Forking jmitchell How-to Forum 2 01-18-2005 08:58 AM
Forking when spidering cybercox Mod Submissions 11 10-12-2004 09:40 PM


All times are GMT -8. The time now is 07:32 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.