PDA

View Full Version : Unable to create the content file and crontab not working and ixwebhosting


paullind
02-23-2004, 04:06 PM
Hi

I just switched hosts in another attempt to get phpdig to work, last host operated in safe mode and no crontab support, so i'm now trying IXWebhosting.com, here's their main specs:

IXWebhosting.com
Linux version 2.4.22, (Red Hat Linux 7.3 2.96-110)
php4

I have 2 problems at the moment:

1] 'Warning : Unable to create the content file ../text_content/4.txt ! '

I can manually enter a site for spidering through admin/index.php but I get partial success. I receive 'Warning : Unable to create the content file ../text_content/4.txt ! ' as part of the result. My 'text_content' folder has the correct permissions and every site gives the same error message (with differnt txt file number).

I say its partial success as i can still search for the site successfully aferwards, would like to know why this shows up though in case it causes other problems, like with my next question regarding CRONTAB

2] Spider works ok manually but not through CRONTAB method. Any suggestions for troubleshooting methods here? Here's the nitty gritty:

Searched other threads and found this command to use:
/usr/bin/php -f /path/to/admin/spider.php cronlist2.txt >> spider.log .

In which cronlist2.txt contains a list of full url's, one per line, ie like http://www.phpdig.net

All it does is spit out a blank spider.log file . When I manually enter the sites (through admin/index.php) they work as above.

Thanks in advance,

Paul L

Charter
02-23-2004, 04:30 PM
Hi. For part one, just to be sure... the following directories are chmod 777 permissions?

[PHPDIG_DIR]/text_content
[PHPDIG_DIR]/include
[PHPDIG_DIR]/admin/temp

For part two, cd to the admin directory and use this command:

php -f spider.php cronlist2.txt > spider.log 2>&1

What does the spider.log contain now?

paullind
02-23-2004, 05:13 PM
hi Charter

Got the first issue straightened out, I had previuosly installed and had phpdig running on my own server and uploaded the files to my new commercail server. i guess the text files being there already caused the error, as when i deleted all the old ones the problem dissappeared.

Regarding the Crontab issue:
I don't have any shell level access, i can only enter the commands in a cron tab GUI, I'll try them with your suggested modifications.

- Tried it still no luck

It seems a lot of people here have full access to their server or own and operate it and can do so. I've learned now that when looking for a commercial host for phpdig you need to ensure the commercial host has 1] php safe mode off 2] optionally, it would be nice to have shell level access, or at least a crontab feature.

Thx again

Charter
02-23-2004, 06:08 PM
Hi. Are you using Cpanel? Some interfaces allow cron jobs to be set that way. If you interface allows such just use the following and then view the spider.log file using FTP:

php -f spider.php cronlist2.txt > spider.log 2>&1

Another thought... maybe your host doesn't allow cron jobs to write to a file? If that is the cae then use:

php -f spider.php cronlist2.txt

paullind
02-24-2004, 03:56 PM
tried with and without spider log output, still no luck. It does print out a blank spider log so i think it can output alright.

Tried simplyfying it too, avoiding the cronlist file:
/usr/bin/php -f /path/to//admin/spider.php http://www.xxxxx.com
Still no luck.

Can anyone think of a way to troubleshoot this problem?

paullind
02-25-2004, 05:08 PM
to troubleshoot cron job a little:

created a file spider2.php, the contents of which simply print out 'hello world'

Worked fine, could output it to a log file too.

Tried setting 777 permission on spider.php, still no luck:bang:

Charter
02-26-2004, 03:01 PM
Hi. What is the content of the spider2.php file? Is it something like the following:

<?php
echo "hello world";
?>

and then "hello world" shows up in the log file?

When you check your phpinfo, is register_argc_argv set to on? If not, in the spider.php file, try setting $_SERVER variables as in this (http://www.phpdig.net/showthread.php?threadid=547) thread.

paullind
03-01-2004, 08:14 AM
Hi

Settings: register_argc_argv=on

I've discovered that an Apache mod install of php does not permit the passing of variables from a a shell command to a php script argv variable.

My isp host suggested this fix to be placed in the spider.php, I placed it just inside the first if statement of spider.php:
foreach ($_GET as $name=> $value)
{
$argv = explode("+", $name);
array_shift ($argv);
}
///this to print out whats passed
foreach ($argv as $key=> $value)
{
echo "the key is $key the value is $value ";
}

It prints out the following in a log file:
-----------------
the key is 1 the value is http:www.yahoo.com Usage: php -f spider.php [option]
Opts: all (default)
forceall
http://something
filename [containing list of urls]
--------------------


So it now seems to be passing the website url to spider but unfortunteltly is doing nothing with it as it does not show up on the list of spidered sites in the admin page.

Should I place this code elsewhere in the script? Or modify it?

Thanks again,

Charter
03-01-2004, 09:39 AM
Hi. Untested, but try the following. In spider.php inside the first if statement, right before the $br = "\n"; line, place the following:

foreach($_GET as $name => $value) {
$argv = explode("+", $name);
$argc = count($argv);
array_shift($argv);
}

paullind
03-01-2004, 04:32 PM
Hi charter

i tried your solution with and without this little extra bit to view argv/c:
foreach ($argv as $key=> $value)
{
echo "the key is $key the value is $value ";
}
echo "argc is $argc ";

The LOG file printed out looks like this:
--------------------------------
the key is 0 the value is /path/to/spider/phpdig/admin/spider.php
the key is 1 the value is http:www.cdncc.com
argc is 2

Usage: php -f spider.php [option]
Opts: all (default)
forceall
http://something
filename [containing list of urls]
-----------------------------------

Still no result of site added to spidered list.

Do the values of argv and argc look correct?

Should 'filename' in the log report above be the site url being spidered, or the 'http://something' list the site I am trying to spider?

Is something in config.php preventing spider.php from doing its thing?


Getting closer.....

paullind
03-01-2004, 05:39 PM
meant to say 'http://www.cdncc.com' in the middle of the last message

paullind
03-02-2004, 04:16 PM
review:

Trying to use shell scripting/crontab to call spider and make it spider list of websites. Apache mod php

Have set up correct crontab command, it calls spider.php and gives it the file with the list of websites(only one there now)

In spider.php $argv have values as:
the key is 0 the value is /path/to/phpdig/admin/spider.php
the key is 1 the value is /path/to/phpdig/admin/cronlist2.txt
$argc is 2

Spider.php calls config.php around line 82 and the script does not make it any further beyound this include statment to config.php.

Inside config.php at line 16 I believe this 'if' statement terminates the spidering process:
--------------------
if ((isset($relative_script_path)) && ($relative_script_path != ".") && ($relative_script_path != "..")) {
exit();
}
if (eregi("config.php",$_SERVER['SCRIPT_FILENAME']) || eregi("config.php",$_SERVER['REQUEST_URI'])) {
exit();
}
---------------------
My $relative_script_path is: /path/to/phpdig/ ,so it will exit in the first 'if'.

Why exit here? Should my $relative_script_path be something different?

Has anyone ever combined all the include files into one massive spider.php and run it to avoid potential errors with include files?

Thx again

Charter
03-02-2004, 04:22 PM
Hi. Try adding the path in the cofig.php file like so:

if ((isset($relative_script_path)) &&
($relative_script_path != ".") &&
($relative_script_path != "..") &&
($relative_script_path != "/path/to/phpdig/")) {
exit();
}

paullind
03-02-2004, 05:20 PM
I applied your code above and it has gotten me further along.

I've made it to the include statement in config.php for the connect.php .

The connect script does not seem to work when accessed this way (shell command)

I can manually enter sites to spider, connection ok that way.

In phpMyAdmin I get this message at the begining:
MySQL 3.23.49-log running on 69.49.xxx.yy as abcdef@69.49.aaa.bb

I normally use the first number as the host value in the connection script, I tried the second one also, both same result, the connect script does not make it beyond :
$id_connect = @mysql_connect(PHPDIG_DB_HOST,PHPDIG_DB_USER,PHPDIG_DB_PASS);

I guess a shell script cannot access MySql the same way? I'll ask my hosting service about that one.

Thx again,

paul L:(

paullind
03-04-2004, 09:06 AM
My host did something which now allows the mysql connection script to work when called from shell/crontab as my output log file is now:
----------------------------
26412: old priority 0, new priority 18
Spidering in progress...*http://www.lockmonsters.com/ Locked*
*http://www.lockmonsters.com/ Locked*
*http://www.lockmonsters.com/ Locked*
*http://www.lockmonsters.com/ Locked*
*http://www.lockmonsters.com/ Locked*
*http://www.lockmonsters.com/ Locked*
*http://www.lockmonsters.com/ Locked*
*http://www.lockmonsters.com/ Locked*
*http://www.lockmonsters.com/ Locked*
*http://www.lockmonsters.com/ Locked*
*http://www.lockmonsters.com/ Locked*
*http://www.lockmonsters.com/ Locked*
---------------------------------
When I go to my browser and admin page, the site shows up in the spidered list (yippee!), but as locked.

I'll try to figure out why it showed up as locked now. (When I enter the site manually through the browser the site spiders alright)

paullind
03-04-2004, 11:08 AM
dropped table, created them again, spidering looking good now.

This site : http://www.newyorkislanders.com shows up on admin page as spidered but any search for it fails.Not sure why this page does not show up after spidering than searching for it.

Thanks Charter for helping me get this far.

Long live phpdig !

Charter
03-04-2004, 12:54 PM
Hi. Do you happen to know what files in the text_content directory are for that site so you can try a search on some words in these files? If not, try looking in the keywords table for a unique word that would appear on a page for that site and get the key_id, and then go to the engine table and get the spider_id for that key_id, and then go to the text_content directory and look for a file titled spider_id.txt and see what words are in that file, and then try a search using words in the spider_id.txt file.

paullind
03-06-2004, 05:40 AM
Hi Charter

I'm going to hold off until I work things out with my host, ixwebhosting.

After 2 weeks of trouble shooting to get the script to work (in which they helped me a great deal), they now tell me the script violates 3rd party policies and asked me not to run the spider script anymore.

I am going to make a new thread now (i'll seacrh for a similar one first) asking people to recommend a good host for phpdig in north america (ie good price and little trouble shooting to get the spider script to work with crontab).

Thanks again for your help, I'll pursue the ideas of your last post hopefully in the not too distant future.

Paul L