PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Troubleshooting (http://www.phpdig.net/forum/forumdisplay.php?f=22)
-   -   Cronjob and index problem (http://www.phpdig.net/forum/showthread.php?t=1706)

djavet 01-06-2005 03:32 AM

Cronjob problem
 
Hello,

I've a web site I wish index with this great tool: www.john-howe.com
No problem with the web interface with a depth of 5000, execpt it' stop after 30 minutes (once after 1 Hours 23 minutes).
I read some post in this forum, i try to get it from a cronjobs, I try my crontab on my host where I am with john-howe.com
No way, it won't work... ;o(
I try from my personnal web site: www.metadelic.com on my cpanel with this synthax:
Code:

php -f http://www.john-howe.com/admin/spider.php http://www.john-howe.com
I recieve this error:
Code:

No input file specified
ok, after a few post, I try:
Code:

wget http://www.john-howe.com/admin/spider.php http://www.john-howe.com
and receive this per email:
Code:

Subject: Cron <metadeco@server857> wget http://www.john-howe.com/admin/spider.php http://www.john-howe.com  All headers 
--13:22:00--  http://www.john-howe.com/admin/spider.php
          => `spider.php.3'
Resolving www.john-howe.com... done.
Connecting to www.john-howe.com[213.131.229.3]:80... connected.
HTTP request sent, awaiting response... 404 Not Found
13:22:01 ERROR 404: Not Found.

--13:22:01--  http://www.john-howe.com/
          => `index.html.6'
Connecting to www.john-howe.com[213.131.229.3]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

  0K .......... .........                                    54.50 KB/s

13:22:01 (54.50 KB/s) - `index.html.6' saved [19587]


FINISHED --13:22:01--
Downloaded: 19,587 bytes in 1 files

I'm disapointed, why a error 404?....
So how can I index the whole site?
Any suggestion.

A lot of thx for your help and times,
Dominique Javet

djavet 01-06-2005 05:49 AM

And I forgot... it save nothing into DB!

Regards, DOM

djavet 01-06-2005 01:03 PM

Stop indexing in web interface
 
Hello,

I've install the last verison of phpdig and al is ok, I can index a part of my web site (where is installed phpdig), but when I try to index the whole site, after a certain time (r****mly?) it's stop indexing! and keep lock the site.
I've safe_mode off and I dont think it's the timeout.

I've also a few problem with cron job (don't work), but i wish make the first index form web and from root (my site had a 1500 pages (stat + dynamic)).
-> http://www.phpdig.net/forum/showthread.php?t=1706

Do you experience some problem like this? What can I do?
Do I index part to part of my site or can I say index the whole site and let turn spider.php all the night to index via web interface?
How do you proceed?

Regards, Dom

vinyl-junkie 01-06-2005 06:34 PM

Is the following URL where your spider.php file is?

Code:

http://www.john-howe.com/admin/spider.php
Seems to be like it should be:

Code:

/[PHPDIG-directory]/admin/spider.php
That's probably where your 404 is coming from. Note the relative script path, and that it may not work that way. I ended up having to do mine like this:

Code:

/home/username/public_html/[PHPDIG-directory]/admin/spider.php
So your whole command should probably look like this:

Code:

php -f /home/username/public_html/[PHPDIG-directory]/admin/spider.php http://www.john-howe.com/
Check out the phpdig documentation for more details on the command line interface.

djavet 01-06-2005 11:03 PM

Thx for your reply.
I try this one from a external server:
Code:

wget http://www.john-howe.com/search/admin/spider.php http://www.john-howe.com
I've the following result:
Code:

Subject: Cron <metadeco@server857> wget http://www.john-howe.com/search/admin/spider.php http://www.john-howe.com        All headers
--08:55:00--  http://www.john-howe.com/search/admin/spider.php
          => `spider.php.5'
Resolving www.john-howe.com... done.
Connecting to www.john-howe.com[213.131.229.3]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

  0K                                                          15.52 KB/s

08:55:01 (15.52 KB/s) - `spider.php.5' saved [747]

--08:55:01--  http://www.john-howe.com/
          => `index.html.10'
Connecting to www.john-howe.com[213.131.229.3]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

  0K .......... .........                                    81.64 KB/s

08:55:02 (81.64 KB/s) - `index.html.10' saved [19563]


FINISHED --08:55:02--
Downloaded: 20,310 bytes in 2 files

It seem that only index the main page, but not the following... I don't understand....
My unprotected admin:
http://www.john-howe.com/search/admin/

Why when I do the root index it stop aftre a few time? :bang:


I try also this internal via my admin web panel with no result :angry: :
Code:

/usr/bin/php -f /home/www/web330/html/search/admin/temp/spider.php forceall http://www.john-howe.com/search/admin/cronfile.txt >> spider.log

/usr/bin/php -f /home/www/web330/html/search/admin/spider.php http://www.john-howe.com >> spider.log

/usr/bin/php5 -f /home/www/web330/html/search/admin/spider.php http://www.john-howe.com

I don't know what I musst do...
I apprecieate your help and time.

Regards, Dominique

Charter 01-06-2005 11:27 PM

As for the spider stopping prematurely, packets get lost, connections drop, browsers or servers timeout, hosts may kill the process, take your pick. As for setting a cron or running PhpDig from shell, see section 7 of the updated documentation.

djavet 01-07-2005 04:05 AM

Thx a lot! Seem to works :banana:
It's indexing since 2 hours via cron and it's continue.

But I must notice that without replacing the relative path in the files write in config.php, it's not working (for me), I have to replace in all the scripts page and then and only then it's working...

I use the conr job on the same host where is my site. From a another web cron server physically distanced, it's not working. I mus search why.... BTW, now it's working and indexing my site.

Thx a lot for your explaination. Now is clear with your updated documentation.

All my best regards,
Dominique


PS: Super soft et job, merci! Bonjour de la Suisse.

Charter 01-07-2005 04:41 AM

Glad it's working for you, but you don't have to change all the files, just set ABSOLUTE_SCRIPT_PATH in the config file.
PHP Code:

// full path up to but not including admin dir, no end slash
define('ABSOLUTE_SCRIPT_PATH','/home/www/web330/html/search'); 


djavet 01-07-2005 06:50 AM

I've done, but when I go after this on the web interface, I've a blank white screen... and then after the replace of relative path with the absolute, all is working well again.

Hummm...

Dom

Charter 01-07-2005 07:09 AM

What version of PhpDig are you using?

djavet 01-09-2005 10:38 AM

Hello,

The last one, 1.8.6 on Linux.
What I notice too, is that my cron job work only with forceall! When I use all or my domain to update, it's not working...

Regards, Dom

Charter 01-09-2005 11:02 AM

That doesn't make sense. Read the documentation in toto and see if it doesn't help.

djavet 01-09-2005 11:16 AM

I understand and read very carefully the documentation but that the truth :o
Maybe it's depending from the ISP I don't know... I try to with the wget command from a external site, and that's doesnt work. Why it's working for somebody and not for me? Don't know.
I'm still trying and test.

Dom

Charter 01-09-2005 11:19 AM

What is your LIMIT_DAYS set to in the config file?

djavet 01-09-2005 12:03 PM

define('SEARCH_DEFAULT_LIMIT',10); //results per page

define('SPIDER_MAX_LIMIT',2000); //max recurse levels in spider
define('RESPIDER_LIMIT',5); //recurse respider limit for update
define('LINKS_MAX_LIMIT',20); //max links per each level
define('RELINKS_LIMIT',5); //recurse links limit for an update

//for limit to directory, URL format must either have file at end or ending slash at end
//e.g., http://www.domain.com/dirs/ (WITH ending slash) or http://www.domain.com/dirs/dirs/index.php
define('LIMIT_TO_DIRECTORY',false); //limit index to given (sub)directory, no sub dirs of dirs are indexed

define('LIMIT_DAYS',0); //default days before reindex a page
define('SMALL_WORDS_SIZE',2); //words to not index - must be 2 or more
define('MAX_WORDS_SIZE',30); //max word size


Dom


All times are GMT -8. The time now is 05:26 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.