PDA

View Full Version : Not able to index [some site]...


BernhardG
10-10-2003, 12:10 AM
Hi,

I think i recently found a bug in the indexer. Yesterday I tried to index the site http://www.rover-club-berlin.com/ . It is not possible to index this site completly (about 550 pages) - only the first page gets indexed. The problem is that the website author does not have a correct markup style (I think this at least). Other indexers (phpCMS indexer, isearch) can spider this site correctly. So i come to the conclusion that there ist some bug in phpDig.
I have also another problem with the site http://www.rover-club-hessen.de/ . It is possible to index the first level but for example the pages below "Mitglieder" (Members) will not indexed. At the moment I have no idea where the bug is. I don't know if it is a bug in phpDig or a bad markup style.
I was able to index this site with the indexer of phpcms and isearch completly

Bernhard

Rolandks
10-10-2003, 10:07 AM
First Site has 34 BAD Errors in Validator W3C and is full of JAVA :rolleyes:

Line 25, column 17: "FRAMESET" not finished but containing element ended

Line 16, column 15: end tag for "HEAD" which is not finished

- something to much for phpDig ;)

Second Site works fine:


links found : 40
http://www.rover-club-hessen.de/
http://www.rover-club-hessen.de/NOMATCH
http://www.rover-club-hessen.de/burningbook/guestbook.php
http://www.rover-club-hessen.de/rchevo_2/html/about.htm
http://www.rover-club-hessen.de/burningbook/?page=2
http://www.rover-club-hessen.de/burningbook/?page=3
http://www.rover-club-hessen.de/burningbook/gbae.php
http://www.rover-club-hessen.de/rchevo_2/html/members.htm
http://www.rover-club-hessen.de/rchevo_2/html/meetings.htm
http://www.rover-club-hessen.de/rchevo_2/html/forum.htm
http://www.rover-club-hessen.de/rchevo_2/html/tutorials.htm
http://www.rover-club-hessen.de/rchevo_2/html/links.htm
http://www.rover-club-hessen.de/guestbook.php
http://www.rover-club-hessen.de/rchevo_2/
http://www.rover-club-hessen.de/burningbook/guestbook.php?a20198
http://www.rover-club-hessen.de/burningbook/?page=1
http://www.rover-club-hessen.de/burningbook/
http://www.rover-club-hessen.de/burningbook/help.php
http://www.rover-club-hessen.de/rchevo_2/html/spacecake.htm
http://www.rover-club-hessen.de/rchevo_2/html/geisterfahrer.htm
http://www.rover-club-hessen.de/rchevo_2/html/dermeister.htm
http://www.rover-club-hessen.de/rchevo_2/html/hometown.htm
http://www.rover-club-hessen.de/rchevo_2/html/disasterman.htm
http://www.rover-club-hessen.de/rchevo_2/html/dirty-t.htm
http://www.rover-club-hessen.de/rchevo_2/html/thunderdome.htm
http://www.rover-club-hessen.de/rchevo_2/html/thunderdine.htm
http://www.rover-club-hessen.de/rchevo_2/html/englischepatient.htm
http://www.rover-club-hessen.de/rchevo_2/html/fastrabbit.htm
http://www.rover-club-hessen.de/rchevo_2/html/butterflyel.htm
http://www.rover-club-hessen.de/rchevo_2/html/frosty.htm
http://www.rover-club-hessen.de/rchevo_2/html/joker.htm
http://www.rover-club-hessen.de/rchevo_2/html/dragon.htm
http://www.rover-club-hessen.de/rchevo_2/html/treffen011003.htm
http://www.rover-club-hessen.de/rchevo_2/html/oldtimershow.htm
http://www.rover-club-hessen.de/rchevo_2/html/treffen053103.htm
http://www.rover-club-hessen.de/rchevo_2/html/tutorial_1.htm
http://www.rover-club-hessen.de/rchevo_2/html/tutorial_2.htm
http://www.rover-club-hessen.de/rchevo_2/html/roverlinks.htm
http://www.rover-club-hessen.de/rchevo_2/html/clublinks.htm
http://www.rover-club-hessen.de/rchevo_2/html/tuninglinks.htm
Optimizing tables...
Indexing complete !

BernhardG
10-10-2003, 02:40 PM
Hi Roland,

I know that the first page has many errors - and by god I swear I do not wrote this page - but the problem is that other indexers could fetch the site but phpDig not. I'll try to find the bug by myself now.
By the way I think it would be best if an indexer searches just for something that contains src="URL" or href="URL" surrounded by < and >. With this every problem with wrong markup should go away. The problem is that there could be false positives in the generated URL table.

The second site was updated recently so it is possible that some errors are corrected now.

Anyway phpDig is a great project!

Bernhard

BernhardG
10-11-2003, 03:18 AM
Hi!

After some testing with various settings I managed to index http://www.rover-club-berlin.com/ . The only setting I needed to change was PHPDIG_DEFAULT_INDEX to false.

Bernhard