View Single Post
Old 10-28-2003, 09:46 PM   #5
druesome
Orange Mole
 
Join Date: Oct 2003
Posts: 30
Nope. Right now PHPdig cannot accept slashes, it only takes the domain part of a URL.

So, if I try to crawl:

http://www.geocities.com/Area51/Space/

it smokes out Area51/Space and crawls:

http://www.geocities.com/

which practically is a nightmare..

BUT, i should make it clear that http://www.geocities.com/Area51/Space/ is also crawled. I think the reason why the rest of geocities is crawled is because the page contains links to other geocities sites or to the geocities homepage. This is a problem because you may be losing control over which sites to include especially if they are hosted on free servers. I have had experience with another search engine software called xavatoria, which is written in Perl, that addresses this problem quite well. Whichever site I crawl, no matter what folder, it only includes pages within the given folder and the subfolders it links to. So:

http://www.geocities.com/Area51/Space/

is treated as the root URL, and pages like

http://www.geocities.com/Area51/Space/index.html
http://www.geocities.com/Area51/Space/aboutme/
http://www.geocities.com/Area51/Space/contactme/
http://www.geocities.com/Area51/Spac...es/movies.html
http://www.geocities.com/Area51/Spac...tes/books.html

are crawled exclusively.

I hope I'm making some sense here.


Thanks.
druesome is offline   Reply With Quote