PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 12-23-2003, 06:16 AM   #1
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Cron Job on Linux/Apache

Hi. Say you want to run a cron job that spiders on the 1st and 15th of every month.

First make a list of full URLs (e.g., http://www.domain.com) to be crawled, one per line, in a file called cronlist.txt (add or remove URLs in the cronlist.txt file when not indexing).

Then create a file called cronfile.txt that contains the following on one line, editing the full paths as needed:
Code:
0 0 1,15 * * /full/path/to/php -f /full/path/to/admin/spider.php /full/path/to/cronlist.txt >> /full/path/to/spider.log
Finally, make sure that ABSOLUTE_SCRIPT_PATH is correctly set in the config file, and then type the following shell command, editing the full paths as needed:
Code:
/full/path/to/crontab /full/path/to/cronfile.txt -u
When the cron job first runs, a file named spider.log gets automatically created at /full/path/to/spider.log and spider info will be appended to this file. You may wish to delete the spider.log file when not indexing should it get large or use ">" (without quotes) in place of ">>" to overwrite spider.log each time.

You may also replace "/full/path/to/cronlist.txt" (without quotes) in the cronfile.txt file with "http://www/domain.com" or "all" or "forceall" (without quotes) for different indexing options. If you have CRON_ENABLE set to true in the config file, you may use the cronfile.txt created by PhpDig in place of a manually created cronfile.txt file.

To see that your cron job is set, type /full/path/to/crontab -l from shell. If you want to delete the cron job, type /full/path/to/crontab -d from shell.

A general cron tutorial can be found at http://www.linuxhelp.net/guides/cron/
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-11-2004, 02:49 PM   #2
siliconkibou
Green Mole
 
Join Date: Dec 2003
Posts: 11
Thanks for the tutorial...

Questions:

Will this update or simply fresh-spider every site in the text file list?

In connection, will this method use stored usernames and passwords for password protected sites?

What sort of server load(on average) will running spiders on a whole list of sites create?

Lastly, wouldn't it be simpler to setup a cron job to run a "spiderupdate.php" or equivilant?

spiderupdate.php could pull all the URLs out of the database and spider them according to the config settings.

Beats manually entering several hundred URL's(although one could probably export a text file with the URL's from the database table, as well).

Thanks,

-Paul
siliconkibou is offline   Reply With Quote
Old 01-13-2004, 06:16 PM   #3
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Options available via command line indexing are as follows:

#php -f [PHPDIG_DIR]/admin/spider.php [option]

List of options :
- all (default) : Update all hosts ;
- forceall : Force update all hosts ;
- http://mondomaine.tld : Add or update the url ;
- path/file : Add or update all urls listed in the given file.

Some examples are given here and cronlist.txt could be replaced with any of the options.

Option all updates sites according to the time limit as set via the config file or meta tag. Forceall forces the update of sites regardless of time limit. Using a single URL will index or update a site according to the time limit. Using a file will index or update the sites in the file as well as other sites already in the database according to the time limit. If site information is already stored in the database tables, that information should be used in an update. Because of the options available, a "spiderupdate.php" isn't necessary.

As for server load, that depends on the particular server. The best thing to do would be to setup some test sites, try the different options, and then run uptime or top via shell to check server load for your particular machine.

EDIT: As of PhpDig 1.8.0, using a file will index or update only the sites in the file, assuming the tempspider table is empty between runs.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 04-01-2004, 10:46 PM   #4
jerrywin5
Orange Mole
 
Join Date: Mar 2004
Posts: 48
Here is page that will generate cron tabs for you.

http://www.clockwatchers.com/cgi-clockwatchers/crontool
jerrywin5 is offline   Reply With Quote
Old 04-01-2004, 11:48 PM   #5
jerrywin5
Orange Mole
 
Join Date: Mar 2004
Posts: 48
Hi Charter,

This still isn't clear to me. If you set a cron job to index URLs in a file list, once the spider has indexed the list, when the cronjob runs again what method will the spider use to index these URLs? Will it index all pages found regardless if the update date has not been reached or will it skip files that have been recently indexed and are not due to be index yet?
jerrywin5 is offline   Reply With Quote
Old 04-10-2004, 03:49 PM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Using a file will index or update the sites in the file according to the time limit. Only the forceall option ignores this update date.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 06-03-2004, 01:00 AM   #7
blowfish
Green Mole
 
blowfish's Avatar
 
Join Date: May 2004
Posts: 2
autoindexing, RaQ 550, Mailman

I have a RaQ 550 server running Linux, dedicated primarily for the installed Mailman listserver app. Had a difficult time trying to autoindex, using all the various approaches in various threads here. But figured it out and it's now working!

Here's how I did it by simply adding an entry to /etc/crontab

From Charter:
"Hi. If you wish to call spider.php from a directory other than the admin directory, you need to edit the first if statement in the config file so that it allows for the different path, that path being a relative and/or full path UP TO but NOT including the admin directory - no ending slash."

I added in config.php:

if ((isset($relative_script_path)) && ($relative_script_path != ".") && ($relative_script_path != "..") && ($relative_script_path != "/home/.sites/28/site1/.users/91/lists/web/search")) {
exit();

-----
EDIT: As of PhpDig 1.8.1 use the following in the config file.

define('ABSOLUTE_SCRIPT_PATH','/home/.sites/28/site1/.users/91/lists/web/search');
// full path up to but not including admin dir, no end slash

if ((!isset($relative_script_path)) || (($relative_script_path != ".") &&
($relative_script_path != "..") && ($relative_script_path != ABSOLUTE_SCRIPT_PATH))) {
// echo "\n\nPath $relative_script_path not recognized!\n\n";
exit();
}
-----

Then I added to: /etc/crontab

# phpdig autoindex
02 1,13 * * * root php -f /home/.sites/28/site1/.users/91/lists/web/search/admin/spider.php
/home/.sites/28/site1/.users/91/lists/web/search/admin/cronlist.txt >>
/home/.sites/28/site1/.users/91/lists/web/search/admin/spider.log

*note: all one line under the # phpdig autoindex, used full paths for spider.php and spider.log (didn't need full path for php -f), and used cronlist.txt containing the URL to index.

Now phpdig automatically reindexes the site (Mailman listserver) at 1:02am and 1:02pm.

Since phpdig is an external search app, I simply found the right html template in Mailman and added html coding to display the phpdig graphic and linked it to the search page, on the archive's table of contents page.

Hope this info helps out.
blowfish is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Baffled by Cron Job Slider How-to Forum 2 12-15-2004 10:34 PM
Reindex without cron job? ark2424 How-to Forum 8 12-09-2004 04:54 AM
cron job problems takpoli How-to Forum 3 05-12-2004 12:26 PM
Alternative to Cron job? jirving Troubleshooting 1 09-29-2003 04:07 PM
cron job David J Harmon How-to Forum 1 09-27-2003 06:20 AM


All times are GMT -8. The time now is 06:32 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.