PDA

View Full Version : Locking using Cron


Slider
01-04-2005, 10:50 PM
I have searched and read all posts that have to do with "Locked" sites

I use cron to start the spider. After awhile of spidering it locks a site and the spider stops. After unlocking the site the spider does not restart unless I catch it just as soon as it happens.
Probably sounds familiar as most posts have said that I read.
As a note: The host I am on is not limiting anything and is very dependable.


1. What are the list of reasons that a site is locked?
2. How much time is alotted betweeen the time a site is locked and when the spider will quit trying? I haveto ask this since the spider doesn't seem to start back up on it's own after a certain amount of time. This will also help if I need to write some custom addition to unlock a site automatically.
3. Am I going to have the same problem when a scheduled update is done in a week or month from now with sites locking?

Just trying to automate a solution so it's not a manual problem.
Most posts talked about spidering with the admin page. I don't use the admin to spider. I only use cron.

Charter
01-05-2005, 01:06 AM
Here is a tutorial I wrote to help you auto unlock and restart the process.

This tutorial assumes a *nix working environment with one spider process.

You will need to Google and/or mod the code as appropriate for your OS/setup.

First make a file containing '.' at /full/path/to/file.txt and set to 777 permission.

Next in spider.php find:

if (USE_RENICE_COMMAND == 1) {
print @exec('renice 18 '.getmypid()).$br;
}

And afterwards add:

$my_loc = "/full/path/to/file.txt";
$my_file = fopen($my_loc,"w+");
fputs($my_file, getmypid());
fclose($my_file);

Then set the following script in a cron job and run it every so often.

<?php
$my_loc = "/full/path/to/file.txt";
$my_pid1 = file_get_contents($my_loc);
$my_pid2 = exec("ps -p $my_pid1 | grep \$? | awk '{print \$1}'");
if ($my_pid1 != $my_pid2) {
/*
- Spider is either dead or index is completed
- Query the tempspider table or query the sites table
- Find num rows in tempspider or locked val in sites
- If num rows or locked equal zero index is completed
- Once completed there is nothing more to be done
- Otherwise the spider is dead so unlock the site
- Then restart the spidering process via cron
- You can do the code for this part ;)
*/
}
?>

Asking me why the spider dies is like asking me why there are dropped packets.

Maybe the MySQL connection hung, a server timed out somewhere, and so forth.

Something somewhere burped...

jmitchell
01-05-2005, 04:47 PM
charter, what do we put in the txt file?

Slider
01-05-2005, 07:47 PM
Thank you Charter. That is exactly what I was looking for. I'll see what I can do with the coding part I would have to make.
Charter: "Then restart the spidering process via cron" <-- this is the only part I will have to figure out now. I seen quite a few posts talking about exec() . It will take me a bit to know how to start the cron session automatically. I will get it though eventully.

Thanks again for the indepth information

djavet
01-12-2005, 12:16 PM
Hello,

I've exactly the same problem. I've a site on Linux and try this code:
<?php
$my_loc = "/home/www/web330/html/search/admin/temp/status.txt";
$my_pid1 = file_get_contents($my_loc);
$my_pid2 = exec("ps -p $my_pid1 | grep \$? | awk '{print \$1}'");
if ($my_pid1 != $my_pid2) {
/*
- Spider is either dead or index is completed
- Query the tempspider table or query the sites table
- Find num rows in tempspider or locked val in sites
- If num rows or locked equal zero index is completed
- Once completed there is nothing more to be done
- Otherwise the spider is dead so unlock the site
- Then restart the spidering process via cron
- You can do the code for this part ;)
*/

exec("/usr/bin/php -f /home/www/web330/html/search/admin/spider.php http://www.john-howe.com > /home/www/web330/html/search/admin/temp/spider.log");

}
?>

I don't know if my code are ok or not. One thing is sure: that's doesn't work on my site and can't work.
It is suppose to write into the file.txt (for me status.txt) something when I run from the admin area the spider.php? He doesn't do anything...
I'm not a pro int php and code, I try to do my best.

A lot of thx for your help and time.

Regards, Dominique

Charter
01-12-2005, 01:56 PM
@jmitchell: Stick a '.' in the file. It doesn't matter, as it will be overwritten anyway.

@djavet: Like I said, "you will need to Google and/or mod the code as appropriate for your OS/setup," as I have no idea what $my_pid* will contain when the code is run on your machine. Also, exec(...) is not enough to do in the if statement, unless you want to keep initiating a spider process for no reason.

djavet
01-12-2005, 10:28 PM
Hello,

Sorry to bother you Charter with my newbie reply. But I'm still learning everydax more ;)
What I don't understand (and it's a little complicated for me at this point of programming), is what to do in the code. My command exec() work and my status.txt is write/updated with some info when I'm spidering: one and only one number like 7358, and then when updated 21753, etc.
Nothing when is write *loked* into spider.log.

Maybe a full working sample code for unlock and restart cron spidering?
I will very appreciate to learn more about, but I can't figue how to do that.
I don't understand what you mean when you write:
Also, exec(...) is not enough to do in the if statement, unless you want to keep initiating a spider process for no reason.


@Slider and @Slider:
Do you a working code to show us?

A lot of thx for your patience with us Charter!
Regards, Dominique

Charter
01-13-2005, 12:26 AM
Processes in a *nix environment get assigned a PID (process ID number) so $my_pid1 contains that PID when the spider is run, and $my_pid2 looks to see if $my_pid1 still exists. Now PIDs are not unique, but when the spider process terminates, $my_pid2 probably won't return $my_pid1, at least in the short term, indicating that the spider process has ended. If $my_pid1 does not equal $my_pid2, then you'd need to determine whether the spider completed its index correctly or whether the spider ended prematurely. The comments in the code suggest a way to do this check, as I don't always have the time, patience, or energy to write complete code. If you simply do an exec() when $my_pid1 is not equal to $my_pid2, then you restart the spider regardless of how the process ended.

djavet
01-13-2005, 12:32 AM
Hello,

Thx for the explanation.
But I'm lost into code :o

Heuuu don't know how to do that. Any help from someone?

Regards, Dominique