![]() |
What about sites on free hosting?
Hey, I wonder if PHP Dig can accommodate sites such as those hosted by Geocities or Tripod etc. I tried to crawl one site in Geocities but instead it tried to crawl the whole Geocities network!! Is there any way around this?
|
It works for me. :confused:
SITE : http://dsyn2.tripod.com/ Exclude paths : - @NONE@ 1:http://dsyn2.tripod.com/main.html links found : 8 http://dsyn2.tripod.com/main.html http://dsyn2.tripod.com/link.html http://dsyn2.tripod.com/rev.html http://dsyn2.tripod.com/ http://dsyn2.tripod.com/fic.html http://dsyn2.tripod.com/rant.html http://dsyn2.tripod.com/more.html http://dsyn2.tripod.com/rmore.html Optimizing tables... Indexing complete ! -Roland- |
Hey,
What I meant are URL's like: www.geocities.com/Area51/Space/1109 (non-existent) or perhaps like www.brinkster.com/druesome (non-existent) because these are how their URL's are formatted. Totally forgot that tripod gave out subdomains, that's why yours worked, roland.:) |
Hi. Does setting PHPDIG_DEFAULT_INDEX to false have any effect?
|
Nope. Right now PHPdig cannot accept slashes, it only takes the domain part of a URL.
So, if I try to crawl: http://www.geocities.com/Area51/Space/ it smokes out Area51/Space and crawls: http://www.geocities.com/ which practically is a nightmare..;) BUT, i should make it clear that http://www.geocities.com/Area51/Space/ is also crawled. I think the reason why the rest of geocities is crawled is because the page contains links to other geocities sites or to the geocities homepage. This is a problem because you may be losing control over which sites to include especially if they are hosted on free servers. I have had experience with another search engine software called xavatoria, which is written in Perl, that addresses this problem quite well. Whichever site I crawl, no matter what folder, it only includes pages within the given folder and the subfolders it links to. So: http://www.geocities.com/Area51/Space/ is treated as the root URL, and pages like http://www.geocities.com/Area51/Space/index.html http://www.geocities.com/Area51/Space/aboutme/ http://www.geocities.com/Area51/Space/contactme/ http://www.geocities.com/Area51/Spac...es/movies.html http://www.geocities.com/Area51/Spac...tes/books.html are crawled exclusively. I hope I'm making some sense here. :) Thanks. |
Hi. Yep, that makes sense. As a temporary solution, how about setting the index levels to one when crawling those links via browser or text file?
|
Hmm.. can you show me how to do that through a text file?
|
Hi. Perhaps try the following.
Set the following in the config file: PHP Code:
PHP Code:
|
All times are GMT -8. The time now is 06:58 AM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.