PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 01-06-2005, 03:32 AM   #1
djavet
Orange Mole
 
Join Date: Jan 2005
Posts: 31
Cronjob problem

Hello,

I've a web site I wish index with this great tool: www.john-howe.com
No problem with the web interface with a depth of 5000, execpt it' stop after 30 minutes (once after 1 Hours 23 minutes).
I read some post in this forum, i try to get it from a cronjobs, I try my crontab on my host where I am with john-howe.com
No way, it won't work... ;o(
I try from my personnal web site: www.metadelic.com on my cpanel with this synthax:
Code:
php -f http://www.john-howe.com/admin/spider.php http://www.john-howe.com
I recieve this error:
Code:
No input file specified
ok, after a few post, I try:
Code:
wget http://www.john-howe.com/admin/spider.php http://www.john-howe.com
and receive this per email:
Code:
Subject: Cron <metadeco@server857> wget http://www.john-howe.com/admin/spider.php http://www.john-howe.com  All headers  
--13:22:00--  http://www.john-howe.com/admin/spider.php
          => `spider.php.3'
Resolving www.john-howe.com... done.
Connecting to www.john-howe.com[213.131.229.3]:80... connected.
HTTP request sent, awaiting response... 404 Not Found
13:22:01 ERROR 404: Not Found.

--13:22:01--  http://www.john-howe.com/
          => `index.html.6'
Connecting to www.john-howe.com[213.131.229.3]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

   0K .......... .........                                     54.50 KB/s

13:22:01 (54.50 KB/s) - `index.html.6' saved [19587]


FINISHED --13:22:01--
Downloaded: 19,587 bytes in 1 files
I'm disapointed, why a error 404?....
So how can I index the whole site?
Any suggestion.

A lot of thx for your help and times,
Dominique Javet
djavet is offline   Reply With Quote
Old 01-06-2005, 05:49 AM   #2
djavet
Orange Mole
 
Join Date: Jan 2005
Posts: 31
And I forgot... it save nothing into DB!

Regards, DOM
djavet is offline   Reply With Quote
Old 01-06-2005, 01:03 PM   #3
djavet
Orange Mole
 
Join Date: Jan 2005
Posts: 31
Stop indexing in web interface

Hello,

I've install the last verison of phpdig and al is ok, I can index a part of my web site (where is installed phpdig), but when I try to index the whole site, after a certain time (r****mly?) it's stop indexing! and keep lock the site.
I've safe_mode off and I dont think it's the timeout.

I've also a few problem with cron job (don't work), but i wish make the first index form web and from root (my site had a 1500 pages (stat + dynamic)).
-> http://www.phpdig.net/forum/showthread.php?t=1706

Do you experience some problem like this? What can I do?
Do I index part to part of my site or can I say index the whole site and let turn spider.php all the night to index via web interface?
How do you proceed?

Regards, Dom
djavet is offline   Reply With Quote
Old 01-06-2005, 06:34 PM   #4
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Is the following URL where your spider.php file is?

Code:
http://www.john-howe.com/admin/spider.php
Seems to be like it should be:

Code:
/[PHPDIG-directory]/admin/spider.php
That's probably where your 404 is coming from. Note the relative script path, and that it may not work that way. I ended up having to do mine like this:

Code:
/home/username/public_html/[PHPDIG-directory]/admin/spider.php
So your whole command should probably look like this:

Code:
php -f /home/username/public_html/[PHPDIG-directory]/admin/spider.php http://www.john-howe.com/
Check out the phpdig documentation for more details on the command line interface.
vinyl-junkie is offline   Reply With Quote
Old 01-06-2005, 11:03 PM   #5
djavet
Orange Mole
 
Join Date: Jan 2005
Posts: 31
Thx for your reply.
I try this one from a external server:
Code:
wget http://www.john-howe.com/search/admin/spider.php http://www.john-howe.com
I've the following result:
Code:
Subject: Cron <metadeco@server857> wget http://www.john-howe.com/search/admin/spider.php http://www.john-howe.com 	All headers
--08:55:00--  http://www.john-howe.com/search/admin/spider.php
          => `spider.php.5'
Resolving www.john-howe.com... done.
Connecting to www.john-howe.com[213.131.229.3]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

   0K                                                          15.52 KB/s

08:55:01 (15.52 KB/s) - `spider.php.5' saved [747]

--08:55:01--  http://www.john-howe.com/
          => `index.html.10'
Connecting to www.john-howe.com[213.131.229.3]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

   0K .......... .........                                     81.64 KB/s

08:55:02 (81.64 KB/s) - `index.html.10' saved [19563]


FINISHED --08:55:02--
Downloaded: 20,310 bytes in 2 files
It seem that only index the main page, but not the following... I don't understand....
My unprotected admin:
http://www.john-howe.com/search/admin/

Why when I do the root index it stop aftre a few time?


I try also this internal via my admin web panel with no result :
Code:
/usr/bin/php -f /home/www/web330/html/search/admin/temp/spider.php forceall http://www.john-howe.com/search/admin/cronfile.txt >> spider.log 

/usr/bin/php -f /home/www/web330/html/search/admin/spider.php http://www.john-howe.com >> spider.log 

/usr/bin/php5 -f /home/www/web330/html/search/admin/spider.php http://www.john-howe.com
I don't know what I musst do...
I apprecieate your help and time.

Regards, Dominique
djavet is offline   Reply With Quote
Old 01-06-2005, 11:27 PM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
As for the spider stopping prematurely, packets get lost, connections drop, browsers or servers timeout, hosts may kill the process, take your pick. As for setting a cron or running PhpDig from shell, see section 7 of the updated documentation.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-07-2005, 04:05 AM   #7
djavet
Orange Mole
 
Join Date: Jan 2005
Posts: 31
Thx a lot! Seem to works
It's indexing since 2 hours via cron and it's continue.

But I must notice that without replacing the relative path in the files write in config.php, it's not working (for me), I have to replace in all the scripts page and then and only then it's working...

I use the conr job on the same host where is my site. From a another web cron server physically distanced, it's not working. I mus search why.... BTW, now it's working and indexing my site.

Thx a lot for your explaination. Now is clear with your updated documentation.

All my best regards,
Dominique


PS: Super soft et job, merci! Bonjour de la Suisse.
djavet is offline   Reply With Quote
Old 01-07-2005, 04:41 AM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Glad it's working for you, but you don't have to change all the files, just set ABSOLUTE_SCRIPT_PATH in the config file.
PHP Code:
// full path up to but not including admin dir, no end slash
define('ABSOLUTE_SCRIPT_PATH','/home/www/web330/html/search'); 
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-07-2005, 06:50 AM   #9
djavet
Orange Mole
 
Join Date: Jan 2005
Posts: 31
I've done, but when I go after this on the web interface, I've a blank white screen... and then after the replace of relative path with the absolute, all is working well again.

Hummm...

Dom
djavet is offline   Reply With Quote
Old 01-07-2005, 07:09 AM   #10
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
What version of PhpDig are you using?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-09-2005, 10:38 AM   #11
djavet
Orange Mole
 
Join Date: Jan 2005
Posts: 31
Hello,

The last one, 1.8.6 on Linux.
What I notice too, is that my cron job work only with forceall! When I use all or my domain to update, it's not working...

Regards, Dom
djavet is offline   Reply With Quote
Old 01-09-2005, 11:02 AM   #12
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
That doesn't make sense. Read the documentation in toto and see if it doesn't help.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-09-2005, 11:16 AM   #13
djavet
Orange Mole
 
Join Date: Jan 2005
Posts: 31
I understand and read very carefully the documentation but that the truth
Maybe it's depending from the ISP I don't know... I try to with the wget command from a external site, and that's doesnt work. Why it's working for somebody and not for me? Don't know.
I'm still trying and test.

Dom
djavet is offline   Reply With Quote
Old 01-09-2005, 11:19 AM   #14
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
What is your LIMIT_DAYS set to in the config file?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-09-2005, 12:03 PM   #15
djavet
Orange Mole
 
Join Date: Jan 2005
Posts: 31
define('SEARCH_DEFAULT_LIMIT',10); //results per page

define('SPIDER_MAX_LIMIT',2000); //max recurse levels in spider
define('RESPIDER_LIMIT',5); //recurse respider limit for update
define('LINKS_MAX_LIMIT',20); //max links per each level
define('RELINKS_LIMIT',5); //recurse links limit for an update

//for limit to directory, URL format must either have file at end or ending slash at end
//e.g., http://www.domain.com/dirs/ (WITH ending slash) or http://www.domain.com/dirs/dirs/index.php
define('LIMIT_TO_DIRECTORY',false); //limit index to given (sub)directory, no sub dirs of dirs are indexed

define('LIMIT_DAYS',0); //default days before reindex a page
define('SMALL_WORDS_SIZE',2); //words to not index - must be 2 or more
define('MAX_WORDS_SIZE',30); //max word size


Dom
djavet is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Spider problem, Search mb_ereg_replace problem. (Fixed?!) cpeter Troubleshooting 0 02-24-2006 01:56 PM
index just homepage with a cronjob (the mole is too deep) propain How-to Forum 0 02-14-2005 12:02 AM
the same problem - not index all links redlock Troubleshooting 0 12-28-2004 06:36 AM
Cronjob for spidering doen't work anymore with PhpDig 1.8.6 gaam Troubleshooting 0 12-22-2004 12:28 AM
Index problem: missing files gvelden Troubleshooting 2 04-21-2004 04:54 AM


All times are GMT -8. The time now is 10:12 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.