PDA

View Full Version : Yet Another indexing question


ffe
01-22-2005, 12:00 PM
I have one server. RH 9.0 runs the Apache, MySQL, and 5 virtual web sites. I am able to index 4 of the sites successfully. The last site, will only index 3-4 pages then quits with no error or completion messages. I suspect the failure is caused by HTML page content. It might be an HTML coding error or obsolete style etc..

My question is: Are there any known coding styles/tags, comments etc. in HTML that will cause the spider to terminate abnormally? My failing (spider) pages display and behave correctly with MSIE, Netscape 7.1, and Firefox 1.0.

Charter
01-22-2005, 12:21 PM
Given that your 'last site' works across browsers, I doubt it's an HTML issue. Without knowing more about this last site, all I can suggest is to select the 'no' radio button, set 'search depth' to a large value, set 'links per' to zero, and give it a whirl. Depending on this last site, you might try setting LIMIT_TO_DIRECTORY to false and PHPDIG_IN_DOMAIN to true, both in the config file.

ffe
01-22-2005, 12:42 PM
Thanks for the interest in the question.. I have read a number of the other posts looking for clues to the problem. I have tried all optons you mention including making changes to the config.php.

Does line length in the HTML files have any affect on the spider? Like a buffer overflow perahps?

Charter
01-22-2005, 01:41 PM
How many MB is the max-sized page? What's the link to the site?

ffe
01-23-2005, 07:16 AM
None of the pages are particularly large. None over 50Kb. Below shows the result of the indexing process. This happens every time.

Spidering in progress... [Stop spider]
SITE : http://tulare.homelinux.net/
Exclude paths :
- @NONE@
1:http://tulare.homelinux.net/index.html
(time : 00:00:05)
+ + + + + + + + + + + + + + + + + + + + +
level 1...
2:http://tulare.homelinux.net/Chance_Phelps.html
(time : 00:00:24)

3:http://tulare.homelinux.net/underway.html
(time : 00:00:29)



The status line at the bottom of the browser screen shows "Done".

Thanks for the interest.