PhpDig.net - View Single Post

obottek · 02-16-2004, 01:56 AM

Multiple spider process
Maybe that all is a little to complicated thought. Think of phpdig would allow multiple spider-processes in general, okay. That would mean, that one would define the amount of spider processes in the config file. Now we would need a sort of starter scripts, which calls the n spider processes. This starter script should generate a unique id (let's say md5-hash of timestamp and a r****m number to spilt parallel jobs). Now the temp-spider table get's a new column, which will be filled by the spider processes with that unique id. Now the spider processes should look for all records in temp, which do not have that unique id and which are not blocked. Blocked sites should be simply skipped. The spider processes should then of course only work down records from the temp table of the current site (site id is given there, so that shouldn't be a problem).

That way, multiple spider processes would spider next to eachother and would step by step work down the site table. So each site would be spidered by one process only, but multiple sites would be spidered by the first available spider process and would therefore work more or less parallel which increases speed.

Timeout for connections
It's a pitty, that you merged the threads, because they handle complete different themes.

However, timeout could be defined on fsockopen's (socket_set_timeout). So maybe that can be integrated into the config in the next release. I mean, you know better, where you have used those commands.

Greetings,
Olaf

02-16-2004, 01:56 AM	#5
obottek Green Mole Join Date: Sep 2003 Posts: 15	Multiple spider process Maybe that all is a little to complicated thought. Think of phpdig would allow multiple spider-processes in general, okay. That would mean, that one would define the amount of spider processes in the config file. Now we would need a sort of starter scripts, which calls the n spider processes. This starter script should generate a unique id (let's say md5-hash of timestamp and a r**m number to spilt parallel jobs). Now the temp-spider table get's a new column, which will be filled by the spider processes with that unique id. Now the spider processes should look for all records in temp, which do not have that unique id and which are not blocked. Blocked sites should be simply skipped. The spider processes should then of course only work down records from the temp table of the current site (site id is given there, so that shouldn't be a problem). That way, multiple spider processes would spider next to eachother and would step by step work down the site table. So each site would be spidered by one process only, but multiple sites would be spidered by the first available spider process and would therefore work more or less parallel which increases speed. Timeout for connections** It's a pitty, that you merged the threads, because they handle complete different themes. However, timeout could be defined on fsockopen's (socket_set_timeout). So maybe that can be integrated into the config in the next release. I mean, you know better, where you have used those commands. Greetings, Olaf