PDA

View Full Version : Spider Problem


i_am_cam
12-28-2003, 12:50 PM
Hey, i'm having a problem trying to spider my site. I've read through the other forum topics that match my symptoms but while the blame mostly is aimed at safe_mode being on, my host has it off.

Basically when i'm running spider.php it starts indexing as i'd expect but then hangs after about 10seconds with IE displaying the 'done' alert in the status bar. On the admin index page it says the page is locked as the spider is still running, but nothing else is added.

I downloaded the latest version of phpdig (1.6.5 i believe) and am on PHP Version 4.2.3 (to see my PHP settings if it helps, look here (http://www.liveinglasgow.com/test.php)).

The work the spider manages before the hanging bug is what i'd expect .. i can search the pages it's indexed and am pleased with the results, just I need the whole site done! ;)

The search results page can be found here (http://www.liveinglasgow.com/search/search.php).

Thanks for any help :)

--Edit--

I've just added this screenshot of what happens while running spider.php, in case this is of any help:

Spider.php Screenshot (http://www.liveinglasgow.com/gfx/spider.jpg)

--2nd Edit--

And thought i'd add my robots.txt file too

User-agent: PhpDig
Disallow: /forum
Disallow: /phpMyAdmin
Disallow: /sql
Disallow: /templates
Disallow: /templates_c
Allow: /forum/index.php

User-agent: *
Disallow: /

Charter
12-28-2003, 02:27 PM
Hi. PhpDig is restrictive when it parses a robots.txt file. Try applying the code in this (http://www.phpdig.net/showthread.php?threadid=269) thread and then set the robots.txt file as so:

User-agent: PhpDig
Disallow:

User-agent: *
Disallow: /

After a crawl, you can delete/exclude directories from the admin panel. Also, does the hang always happen, and what entries are in the tempspider table?

i_am_cam
12-28-2003, 02:39 PM
Originally posted by Charter
Hi. PhpDig is restrictive when it parses a robots.txt file. Try applying the code in this (http://www.phpdig.net/showthread.php?threadid=269) thread and then set the robots.txt file as so:

User-agent: PhpDig
Disallow:

User-agent: *
Disallow: /

After a crawl, you can delete/exclude directories from the admin panel. Also, does the hang always happen, and what entries are in the tempspider table?

Hi, firstly thanks for the speedy reply!

I've changed the code as suggested in the thread you linked to and modified the robots.txt file as you said, and am getting the same problem each time .. namely that spider.php freezes during the indexing process and locks the site while not indexing any further. I should also mention I have tried completely removing the robots.txt file with no success.

As for the tempspider table, here is the phpMyAdmin dumps in csv (http://www.liveinglasgow.com/search/tempspider-csv.txt) and xml (http://www.liveinglasgow.com/search/tempspider-xml.txt)

Charter
12-28-2003, 02:56 PM
Hi. Adding css to the FORBIDDEN_EXTENSIONS in the config file should prevent errors from appearing in the tempspider table. Anyway, this seems like a timeout issue. What is the time limit in httpd.conf?

i_am_cam
12-28-2003, 03:05 PM
ok i've changed the config line to

define('FORBIDDEN_EXTENSIONS','\.(gz|z|tar|css|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$');

to try and eliminate the .css files being indexed and causing errors.

as for httpd.conf, I don't have access to this file on my host :/

my max_execution_time is set to 50000 if this helps

Charter
12-28-2003, 04:45 PM
Hi. What happens if you try to crawl using a different browser?

i_am_cam
12-29-2003, 02:04 AM
Originally posted by Charter
Hi. What happens if you try to crawl using a different browser?

Okay, i've just tried crawling in Mozilla 1.5 and Firebird 0.7 (originally I was using IE) and the end result while running spider.php is the same; it freezes, locks the site, but doesn't actually index any further.

Charter
12-29-2003, 05:47 AM
Hi. I'm thinking that this is a timeout issue with your PHP being in CGI mode. The max_execution_time says 50000 but it seems like the timeout is 30 seconds. What errors, if any, are showing in your PHP error log?

i_am_cam
12-29-2003, 06:27 AM
i'm afraid that log_errors in php.ini is set to Off by my host and I don't have access to php.ini in order to change this

Charter
12-29-2003, 07:04 AM
Hi. I crawled your site using PHP with Server API as Apache and as CGI. Both were successful and below is the output using PHP in CGI mode. Maybe your host can shed some light on this issue as I'm not sure of the problem. :(

Spidering in progress...

--------------------------------------------------------------------------------
SITE : http://liveinglasgow.com/
Exclude paths :
- @NONE@
1:http://liveinglasgow.com/archive.php
(time : 00:00:07)
+ + + + + + +
level 1...
2:http://liveinglasgow.com/privacy.php
(time : 00:00:14)
+
3:http://liveinglasgow.com/archive.php?start=21&sort=&mode=&size=
(time : 00:00:19)
+ + + + + +
4:http://liveinglasgow.com/archive.php?start=&sort=date&mode=desc&size=
(time : 00:00:25)

5:http://liveinglasgow.com/archive.php?start=&sort=date&mode=asc&size=
(time : 00:00:29)

6:http://liveinglasgow.com/archive.php?start=&sort=title&mode=desc&size=
(time : 00:00:33)

7:http://liveinglasgow.com/archive.php?start=&sort=title&mode=asc&size=
(time : 00:00:38)

8:http://liveinglasgow.com/index.php
(time : 00:00:42)

level 2...
9:http://liveinglasgow.com/archive.php?start=41&sort=&mode=&size=
(time : 00:00:46)
+ + + +
10:http://liveinglasgow.com/archive.php?start=1&sort=&mode=&size=
(time : 00:00:52)
+ + + +
11:http://liveinglasgow.com/archive.php?start=21&sort=date&mode=desc&size=
(time : 00:00:58)

12:http://liveinglasgow.com/archive.php?start=21&sort=date&mode=asc&size=
(time : 00:01:04)

13:http://liveinglasgow.com/archive.php?start=21&sort=title&mode=desc&size=
(time : 00:01:10)

14:http://liveinglasgow.com/archive.php?start=21&sort=title&mode=asc&size=
(time : 00:01:15)

15:http://liveinglasgow.com/
(time : 00:01:17)

level 3...
16:http://liveinglasgow.com/archive.php?start=41&sort=title&mode=desc&size=
(time : 00:01:20)

17:http://liveinglasgow.com/archive.php?start=41&sort=title&mode=asc&size=
(time : 00:01:24)

18:http://liveinglasgow.com/archive.php?start=41&sort=date&mode=asc&size=
(time : 00:01:28)

19:http://liveinglasgow.com/archive.php?start=41&sort=date&mode=desc&size=
(time : 00:01:32)

20:http://liveinglasgow.com/archive.php?start=1&sort=title&mode=asc&size=
(time : 00:01:35)

21:http://liveinglasgow.com/archive.php?start=1&sort=title&mode=desc&size=
(time : 00:01:39)

22:http://liveinglasgow.com/archive.php?start=1&sort=date&mode=asc&size=
(time : 00:01:44)

23:http://liveinglasgow.com/archive.php?start=1&sort=date&mode=desc&size=
(time : 00:01:48)

No link in temporary table

--------------------------------------------------------------------------------

links found : 23
http://liveinglasgow.com/archive.php
http://liveinglasgow.com/privacy.php
http://liveinglasgow.com/archive.php?start=21&sort=&mode=&size=
http://liveinglasgow.com/archive.php?start=&sort=date&mode=desc&size=
http://liveinglasgow.com/archive.php?start=&sort=date&mode=asc&size=
http://liveinglasgow.com/archive.php?start=&sort=title&mode=desc&size=
http://liveinglasgow.com/archive.php?start=&sort=title&mode=asc&size=
http://liveinglasgow.com/index.php
http://liveinglasgow.com/archive.php?start=41&sort=&mode=&size=
http://liveinglasgow.com/archive.php?start=1&sort=&mode=&size=
http://liveinglasgow.com/archive.php?start=21&sort=date&mode=desc&size=
http://liveinglasgow.com/archive.php?start=21&sort=date&mode=asc&size=
http://liveinglasgow.com/archive.php?start=21&sort=title&mode=desc&size=
http://liveinglasgow.com/archive.php?start=21&sort=title&mode=asc&size=
http://liveinglasgow.com/
http://liveinglasgow.com/archive.php?start=41&sort=title&mode=desc&size=
http://liveinglasgow.com/archive.php?start=41&sort=title&mode=asc&size=
http://liveinglasgow.com/archive.php?start=41&sort=date&mode=asc&size=
http://liveinglasgow.com/archive.php?start=41&sort=date&mode=desc&size=
http://liveinglasgow.com/archive.php?start=1&sort=title&mode=asc&size=
http://liveinglasgow.com/archive.php?start=1&sort=title&mode=desc&size=
http://liveinglasgow.com/archive.php?start=1&sort=date&mode=asc&size=
http://liveinglasgow.com/archive.php?start=1&sort=date&mode=desc&size=
Optimizing tables...
Indexing complete !
--------------------------------------------------------------------------------
[Back] to admin interface.

i_am_cam
12-29-2003, 07:07 AM
Okay, i'll email my host and start chatting to them about this, see if they can help at all. Just out of curiosity .. if i had ssh access to a shell on my host and could run this spidering script there, do you think that would work? Or would the same constraints apply as when executing it via my brower?

Thanks a lot for the help Charter, believe me it's very much appreciated :)

Cam

Charter
12-29-2003, 07:45 AM
Hi. My understanding is that program execution via SSH bypasses the web application server, but the program execution would still be subject to the PHP configuration itself.