PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   How-to Forum (http://www.phpdig.net/forum/forumdisplay.php?f=33)
-   -   Automatic spider (http://www.phpdig.net/forum/showthread.php?t=1034)

jdc32 06-30-2004 09:37 AM

Automatic spider
 
hi there,

i want automate the adding of links to the se.
had anyone played too with this idea?

env:
i create a table, in this will stored any new links (a lot links). i call this table my linkspool.

so on,.. i have a cron job every 3 minutes which check, whether a new job (link) is in the spool table. if a new job is in there, the script lock the link and spider it. after the spidering the script delete the link in the spool. finished!

but i have two probs!!!!

first:
if a spider lasts over 3 minutes, it takes the next link from the spool and starts a new spider... thats okay... i check with the script how many spider are running, if it more than 5, the script will exit and wait to a thread is free.
this isnt work really good, how can i check with php how many php spider threads are opened??????????

second:
so with the cron, the spider maschine runs and runs and runs..... but if a spiderjob is locked, out any reason, it blocked the thread.
how can i kill via php the spider php pid which is older than 20 minutes and how kick the link from the se db.

sorry for my bad english :)

jdc

bloodjelly 06-30-2004 10:48 AM

Hi jdc -

If you have a main script (the one that looks at the linkspool and runs spider processes), keeping track of the number of spiders is easy. Just increment a counter every time a spider is called, and when your counter variable reaches 5, you can sleep the script for a period of time and then check again.

To kill the process, check out this thread: http://www.phpdig.net/showthread.php...&highlight=PID

But instead of using a CRON job, you could use exec() or system() commands through PHP.

jdc32 07-01-2004 12:08 AM

okay thats cool,
but with the cron i can kill the spider, but the link which the spider was spidering is still locked in the db. i need a search and destroy session :)

how can i give after kill the spider a parameter (ex. the site_id) to another script which delete all db entries for this link?

thx :)

jdc32 07-01-2004 12:17 AM

hmmm,.... after thinking, the cron is not really good....for my problem:

10 * * * * ps -ef | grep 'php -f spider.php' | awk '{print $2}' | xargs kill -9


i start any 3 minutes a new spider and shoult kill after 10 minutes,... so i have started more than 1 spider... this cron kill all my spider,.. thats no good.

can i kill via shell all php spiders which has a running time from 10 minutes?


All times are GMT -8. The time now is 08:20 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.