View Full Version : spider.php problem

09-29-2006, 03:04 PM

I just installed and integrated phpdig to my website. The install went OK (i.e., phpdig tables are created). But then in index.php, when I tried to index the link by putting down the URI link and click 'Dig This!', which is supposed to direct to spider.php, the spider.php page could not be accessed. This is the browser error message: "The page cannot be displayed ...Cannot find server or DNS Error".

Then, I tried to refresh the page. Now I can see the page, but it didn't seem to work either. Below is the PhpDig message:

Spidering in progress... [Stop spider]
Optimizing tables...
Indexing complete !
to admin interface.

Yes, only those 5 lines, which convince me that spider.php isn't doing anything.

By the way, there is one issue in my server: [B]allow_url_fopen is set to 0. I tried to work on it by adding iniset('allow_url_fopen', '1') at the top of every php page.

I don't know whether allow_url_fopen or another issue is the cause of the problem.

Could somebody help?

Thanks in advance.

09-30-2006, 09:29 AM
PhpDig needs allow_url_fopen set to on. If you are using PHP version greater than 4.3.4, then allow_url_fopen can only be set in the php.ini or httpd.conf files. There is a list here (http://www.php.net/manual/en/ini.php#ini.list) that lets you know 'what is allowed where' when using the ini_set function.

09-30-2006, 04:30 PM
I just contacted the server admin, and he switched the allow_url_fopen value to 1 (ON). But then, same thing still happened. I put the website name, click 'Dig This!', and spider.php was still blank and no indexing activity.

Btw the server that I use is Linux Apache Version 1.3.27 and PHP Version 4.2.3.

I tried to copy the EXACT same phpdig folder to another website on a different server system (this time windows server), and voila, it works. Even when I put the website address of the one in the linux server, it could crawl and index that website.

Then I thought, it might be a path issue (because all the pages are in /usr/home/....../public_html/ and this search folder is in /usr/home/....../public_html/phpdig), so I tried to changed the 'ABSOLUTE_SCRIPT_PATH' to '/usr/home/../../phpdig', still it wouldn't work.

What else should I do? :no:

Btw I noticed the following paragraph in the documentation:

"Note that if your OS/setup is for example a CGI loadbalanced cluster of servers, it may not possible to index sites on the cluster as there cannot be a connection back to the loadbalanced address. Also note that PhpDig is a web spider and search engine, meaning that you may have to edit you hosts file with something like " www.domain.com" in order to get PhpDig to crawl on localhost."

What does this mean?

10-01-2006, 06:50 AM
Are these directories set to 777 permissions on the Linux server?


Load balancing can be where a domain name resolves to multiple IP addresses, so basically PhpDig doesn't know what to do.

10-01-2006, 10:18 AM
please have a look of your database,make sure that it really do not have any data about the web which you send spider to it.
and by the way,do not spider more than 1 site ,phpdig will not work well.
that is all

10-01-2006, 02:07 PM
Yes, all 3 directories are already set to 777.

Btw, on the website there is already a search engine function that was set up by previous developer using generic perl scripts provided by the hosting server company. Would that be the cause of the problem ? (e.g. alexandercer mentioned to not spider more than 1 site since phpdig will not work well; this existing perl-based search function may affect the indexing -or may not - just a thought)

Also alexandercer, I already checked the database and made sure that it was clear of any data of spidered web.

Btw I searched around the phpdig database and compared it with another phpdig database in another website that works. The difference I found was: in 'sites' table, the date in the 'upddate' column is printed in a wrong format, i.e. 20061001185646. In another database where PhpDig works, it was printed in this format: 2006-10-01 18:56:46

Is this symptom of character (date) formatting mistake tells something? Or maybe this is useless and I just digged down too much.

10-02-2006, 07:09 PM

Guess what? :santa: It works now. And the culprit is: robots.txt

I deleted it and the spidering worked like a charm.

Thanks Charter and alexandercer for your help!

Dave A
10-04-2006, 01:14 AM
Sounds like the file is not being found in the path.
From the admin panel where you add the domain and get it to dig it, did you add the depth to which it is to go down too?

I will have a look and see what I can find out for you? but it seems to be path related from what I can see.

10-18-2006, 07:25 AM
but , excuse me , i am confused ,i can not found the file called robots.txt ,where did you found it,or that is only a config file that you write by yourself?
i think it can help us to know better about the pretty mole.