PDA

View Full Version : Cron Job on Linux/Apache


Charter
12-23-2003, 07:16 AM
Hi. Say you want to run a cron job that spiders on the 1st and 15th of every month.

First make a list of full URLs (e.g., http://www.domain.com) to be crawled, one per line, in a file called cronlist.txt (add or remove URLs in the cronlist.txt file when not indexing).

Then create a file called cronfile.txt that contains the following on one line, editing the full paths as needed:

0 0 1,15 * * /full/path/to/php -f /full/path/to/admin/spider.php /full/path/to/cronlist.txt >> /full/path/to/spider.log

Finally, make sure that ABSOLUTE_SCRIPT_PATH is correctly set in the config file, and then type the following shell command, editing the full paths as needed:

/full/path/to/crontab /full/path/to/cronfile.txt -u

When the cron job first runs, a file named spider.log gets automatically created at /full/path/to/spider.log and spider info will be appended to this file. You may wish to delete the spider.log file when not indexing should it get large or use ">" (without quotes) in place of ">>" to overwrite spider.log each time.

You may also replace "/full/path/to/cronlist.txt" (without quotes) in the cronfile.txt file with "http://www/domain.com" or "all" or "forceall" (without quotes) for different indexing options. If you have CRON_ENABLE set to true in the config file, you may use the cronfile.txt created by PhpDig in place of a manually created cronfile.txt file.

To see that your cron job is set, type /full/path/to/crontab -l from shell. If you want to delete the cron job, type /full/path/to/crontab -d from shell.

A general cron tutorial can be found at http://www.linuxhelp.net/guides/cron/

siliconkibou
01-11-2004, 03:49 PM
Thanks for the tutorial...

Questions:

Will this update or simply fresh-spider every site in the text file list?

In connection, will this method use stored usernames and passwords for password protected sites?

What sort of server load(on average) will running spiders on a whole list of sites create?

Lastly, wouldn't it be simpler to setup a cron job to run a "spiderupdate.php" or equivilant?

spiderupdate.php could pull all the URLs out of the database and spider them according to the config settings.

Beats manually entering several hundred URL's(although one could probably export a text file with the URL's from the database table, as well).

Thanks,

-Paul

Charter
01-13-2004, 07:16 PM
Hi. Options available via command line indexing are as follows:

#php -f [PHPDIG_DIR]/admin/spider.php [option]

List of options :
- all (default) : Update all hosts ;
- forceall : Force update all hosts ;
- http://mondomaine.tld : Add or update the url ;
- path/file : Add or update all urls listed in the given file.

Some examples are given here (http://www.phpdig.net/navigation.php?action=doc#toc8) and cronlist.txt could be replaced with any of the options.

Option all updates sites according to the time limit as set via the config file or meta tag. Forceall forces the update of sites regardless of time limit. Using a single URL will index or update a site according to the time limit. Using a file will index or update the sites in the file as well as other sites already in the database according to the time limit. If site information is already stored in the database tables, that information should be used in an update. Because of the options available, a "spiderupdate.php" isn't necessary.

As for server load, that depends on the particular server. The best thing to do would be to setup some test sites, try the different options, and then run uptime or top via shell to check server load for your particular machine.

EDIT: As of PhpDig 1.8.0, using a file will index or update only the sites in the file, assuming the tempspider table is empty between runs.

jerrywin5
04-01-2004, 11:46 PM
Here is page that will generate cron tabs for you.

http://www.clockwatchers.com/cgi-clockwatchers/crontool

jerrywin5
04-02-2004, 12:48 AM
Hi Charter,

This still isn't clear to me. If you set a cron job to index URLs in a file list, once the spider has indexed the list, when the cronjob runs again what method will the spider use to index these URLs? Will it index all pages found regardless if the update date has not been reached or will it skip files that have been recently indexed and are not due to be index yet?

Charter
04-10-2004, 04:49 PM
Hi. Using a file will index or update the sites in the file according to the time limit. Only the forceall option ignores this update date.

blowfish
06-03-2004, 02:00 AM
I have a RaQ 550 server running Linux, dedicated primarily for the installed Mailman listserver app. Had a difficult time trying to autoindex, using all the various approaches in various threads here. But figured it out and it's now working! :D

Here's how I did it by simply adding an entry to /etc/crontab

From Charter:
"Hi. If you wish to call spider.php from a directory other than the admin directory, you need to edit the first if statement in the config file so that it allows for the different path, that path being a relative and/or full path UP TO but NOT including the admin directory - no ending slash."

I added in config.php:

if ((isset($relative_script_path)) && ($relative_script_path != ".") && ($relative_script_path != "..") && ($relative_script_path != "/home/.sites/28/site1/.users/91/lists/web/search")) {
exit();

-----
EDIT: As of PhpDig 1.8.1 use the following in the config file.

define('ABSOLUTE_SCRIPT_PATH','/home/.sites/28/site1/.users/91/lists/web/search');
// full path up to but not including admin dir, no end slash

if ((!isset($relative_script_path)) || (($relative_script_path != ".") &&
($relative_script_path != "..") && ($relative_script_path != ABSOLUTE_SCRIPT_PATH))) {
// echo "\n\nPath $relative_script_path not recognized!\n\n";
exit();
}
-----

Then I added to: /etc/crontab

# phpdig autoindex
02 1,13 * * * root php -f /home/.sites/28/site1/.users/91/lists/web/search/admin/spider.php
/home/.sites/28/site1/.users/91/lists/web/search/admin/cronlist.txt >>
/home/.sites/28/site1/.users/91/lists/web/search/admin/spider.log

*note: all one line under the # phpdig autoindex, used full paths for spider.php and spider.log (didn't need full path for php -f), and used cronlist.txt containing the URL to index.

Now phpdig automatically reindexes the site (Mailman listserver) at 1:02am and 1:02pm.

Since phpdig is an external search app, I simply found the right html template in Mailman and added html coding to display the phpdig graphic and linked it to the search page, on the archive's table of contents page.

Hope this info helps out.