Cronjob and index problem [Archive]

View Full Version : Cronjob and index problem

djavet

01-06-2005, 03:32 AM

Hello,

I've a web site I wish index with this great tool: www.john-howe.com
No problem with the web interface with a depth of 5000, execpt it' stop after 30 minutes (once after 1 Hours 23 minutes).
I read some post in this forum, i try to get it from a cronjobs, I try my crontab on my host where I am with john-howe.com
No way, it won't work... ;o(
I try from my personnal web site: www.metadelic.com on my cpanel with this synthax:
php -f http://www.john-howe.com/admin/spider.php http://www.john-howe.com I recieve this error:
No input file specified

ok, after a few post, I try:
wget http://www.john-howe.com/admin/spider.php http://www.john-howe.com

and receive this per email:
Subject: Cron <metadeco@server857> wget http://www.john-howe.com/admin/spider.php http://www.john-howe.com All headers
--13:22:00-- http://www.john-howe.com/admin/spider.php
=> `spider.php.3'
Resolving www.john-howe.com... done.
Connecting to www.john-howe.com[213.131.229.3]:80... connected.
HTTP request sent, awaiting response... 404 Not Found
13:22:01 ERROR 404: Not Found.

--13:22:01-- http://www.john-howe.com/
=> `index.html.6'
Connecting to www.john-howe.com[213.131.229.3]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

0K .......... ......... 54.50 KB/s

13:22:01 (54.50 KB/s) - `index.html.6' saved [19587]

FINISHED --13:22:01--
Downloaded: 19,587 bytes in 1 files

I'm disapointed, why a error 404?....
So how can I index the whole site?
Any suggestion.

A lot of thx for your help and times,
Dominique Javet

djavet

01-06-2005, 05:49 AM

And I forgot... it save nothing into DB!

Regards, DOM

djavet

01-06-2005, 01:03 PM

Hello,

I've install the last verison of phpdig and al is ok, I can index a part of my web site (where is installed phpdig), but when I try to index the whole site, after a certain time (r****mly?) it's stop indexing! and keep lock the site.
I've safe_mode off and I dont think it's the timeout.

I've also a few problem with cron job (don't work), but i wish make the first index form web and from root (my site had a 1500 pages (stat + dynamic)).
-> http://www.phpdig.net/forum/showthread.php?t=1706

Do you experience some problem like this? What can I do?
Do I index part to part of my site or can I say index the whole site and let turn spider.php all the night to index via web interface?
How do you proceed?

Regards, Dom

vinyl-junkie

01-06-2005, 06:34 PM

Is the following URL where your spider.php file is?

http://www.john-howe.com/admin/spider.php
Seems to be like it should be:

/[PHPDIG-directory]/admin/spider.php

That's probably where your 404 is coming from. Note the relative script path, and that it may not work that way. I ended up having to do mine like this:

/home/username/public_html/[PHPDIG-directory]/admin/spider.php
So your whole command should probably look like this:

php -f /home/username/public_html/[PHPDIG-directory]/admin/spider.php http://www.john-howe.com/
Check out the phpdig documentation (http://www.phpdig.net/navigation.php?action=doc#toc8) for more details on the command line interface.

djavet

01-06-2005, 11:03 PM

Thx for your reply.
I try this one from a external server:
wget http://www.john-howe.com/search/admin/spider.php http://www.john-howe.com
I've the following result:
Subject: Cron <metadeco@server857> wget http://www.john-howe.com/search/admin/spider.php http://www.john-howe.com All headers
--08:55:00-- http://www.john-howe.com/search/admin/spider.php
=> `spider.php.5'
Resolving www.john-howe.com... done.
Connecting to www.john-howe.com[213.131.229.3]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

0K 15.52 KB/s

08:55:01 (15.52 KB/s) - `spider.php.5' saved [747]

--08:55:01-- http://www.john-howe.com/
=> `index.html.10'
Connecting to www.john-howe.com[213.131.229.3]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

0K .......... ......... 81.64 KB/s

08:55:02 (81.64 KB/s) - `index.html.10' saved [19563]

FINISHED --08:55:02--
Downloaded: 20,310 bytes in 2 files

It seem that only index the main page, but not the following... I don't understand....
My unprotected admin:
http://www.john-howe.com/search/admin/

Why when I do the root index it stop aftre a few time? :bang:

I try also this internal via my admin web panel with no result :angry: :
/usr/bin/php -f /home/www/web330/html/search/admin/temp/spider.php forceall http://www.john-howe.com/search/admin/cronfile.txt >> spider.log

/usr/bin/php -f /home/www/web330/html/search/admin/spider.php http://www.john-howe.com >> spider.log

/usr/bin/php5 -f /home/www/web330/html/search/admin/spider.php http://www.john-howe.com

I don't know what I musst do...
I apprecieate your help and time.

Regards, Dominique

Charter

01-06-2005, 11:27 PM

As for the spider stopping prematurely, packets get lost, connections drop, browsers or servers timeout, hosts may kill the process, take your pick. As for setting a cron or running PhpDig from shell, see section 7 (http://www.phpdig.net/navigation.php?action=doc#toc7) of the updated documentation.

djavet

01-07-2005, 04:05 AM

Thx a lot! Seem to works :banana:
It's indexing since 2 hours via cron and it's continue.

But I must notice that without replacing the relative path in the files write in config.php, it's not working (for me), I have to replace in all the scripts page and then and only then it's working...

I use the conr job on the same host where is my site. From a another web cron server physically distanced, it's not working. I mus search why.... BTW, now it's working and indexing my site.

Thx a lot for your explaination. Now is clear with your updated documentation.

All my best regards,
Dominique

PS: Super soft et job, merci! Bonjour de la Suisse.

Charter

01-07-2005, 04:41 AM

Glad it's working for you, but you don't have to change all the files, just set ABSOLUTE_SCRIPT_PATH in the config file.

// full path up to but not including admin dir, no end slash
define('ABSOLUTE_SCRIPT_PATH','/home/www/web330/html/search');

djavet

01-07-2005, 06:50 AM

I've done, but when I go after this on the web interface, I've a blank white screen... and then after the replace of relative path with the absolute, all is working well again.

Hummm...

Dom

Charter

01-07-2005, 07:09 AM

What version of PhpDig are you using?

djavet

01-09-2005, 10:38 AM

Hello,

The last one, 1.8.6 on Linux.
What I notice too, is that my cron job work only with forceall! When I use all or my domain to update, it's not working...

Regards, Dom

Charter

01-09-2005, 11:02 AM

That doesn't make sense. Read the documentation (http://www.phpdig.net/navigation.php?action=doc) in toto and see if it doesn't help.

djavet

01-09-2005, 11:16 AM

I understand and read very carefully the documentation but that the truth :o
Maybe it's depending from the ISP I don't know... I try to with the wget command from a external site, and that's doesnt work. Why it's working for somebody and not for me? Don't know.
I'm still trying and test.

Dom

Charter

01-09-2005, 11:19 AM

What is your LIMIT_DAYS set to in the config file?

djavet

01-09-2005, 12:03 PM

define('SEARCH_DEFAULT_LIMIT',10); //results per page

define('SPIDER_MAX_LIMIT',2000); //max recurse levels in spider
define('RESPIDER_LIMIT',5); //recurse respider limit for update
define('LINKS_MAX_LIMIT',20); //max links per each level
define('RELINKS_LIMIT',5); //recurse links limit for an update

//for limit to directory, URL format must either have file at end or ending slash at end
//e.g., http://www.domain.com/dirs/ (WITH ending slash) or http://www.domain.com/dirs/dirs/index.php
define('LIMIT_TO_DIRECTORY',false); //limit index to given (sub)directory, no sub dirs of dirs are indexed

define('LIMIT_DAYS',0); //default days before reindex a page
define('SMALL_WORDS_SIZE',2); //words to not index - must be 2 or more
define('MAX_WORDS_SIZE',30); //max word size

Dom

Charter

01-09-2005, 12:12 PM

define('SPIDER_MAX_LIMIT',2000); //max recurse levels in spider
define('RESPIDER_LIMIT',5); //recurse respider limit for update
define('LINKS_MAX_LIMIT',20); //max links per each level
define('RELINKS_LIMIT',5); //recurse links limit for an update

Try setting RESPIDER_LIMIT and RELINKS_LIMIT to 2000 and 20 respectively.

djavet

01-09-2005, 10:33 PM

I will try this tonigh.
After a few days of test and read the documentation, the concept behind these setting (recurse link, etc) are not so clear. Why use 2000 instead of 20, why use sleep(5) when sleep (2) work really fine, etc... mst I keep the t.t files under text_content when indeing is done? etc...

One thing I notice, is that my ISP update the cronjob every 30 minutes, on my other ISP it's every minute... :bang: I had saved a lot of time and frustration when I know this! And offcourse, isn't documented into the ISP online help...

So, I continue with my test. I will let you know.

Exist a tutorial abour fine tuning of the setting we discuss? I found a cron tutorial, but nothing else.
BTW, it's really impressive. I've a site with 3400 indexed pages, 696'500 keywords. Excellent.

Regards, Dom