PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Troubleshooting (http://www.phpdig.net/forum/forumdisplay.php?f=22)
-   -   What about sites on free hosting? (http://www.phpdig.net/forum/showthread.php?t=179)

druesome 10-27-2003 05:08 AM

What about sites on free hosting?
 
Hey, I wonder if PHP Dig can accommodate sites such as those hosted by Geocities or Tripod etc. I tried to crawl one site in Geocities but instead it tried to crawl the whole Geocities network!! Is there any way around this?

Rolandks 10-27-2003 08:22 AM

It works for me. :confused:

SITE : http://dsyn2.tripod.com/
Exclude paths :
- @NONE@
1:http://dsyn2.tripod.com/main.html

links found : 8
http://dsyn2.tripod.com/main.html
http://dsyn2.tripod.com/link.html
http://dsyn2.tripod.com/rev.html
http://dsyn2.tripod.com/
http://dsyn2.tripod.com/fic.html
http://dsyn2.tripod.com/rant.html
http://dsyn2.tripod.com/more.html
http://dsyn2.tripod.com/rmore.html
Optimizing tables...
Indexing complete !

-Roland-

druesome 10-27-2003 08:41 AM

Hey,

What I meant are URL's like:

www.geocities.com/Area51/Space/1109 (non-existent)

or perhaps like

www.brinkster.com/druesome (non-existent)

because these are how their URL's are formatted. Totally forgot that tripod gave out subdomains, that's why yours worked, roland.:)

Charter 10-28-2003 07:18 PM

Hi. Does setting PHPDIG_DEFAULT_INDEX to false have any effect?

druesome 10-28-2003 09:46 PM

Nope. Right now PHPdig cannot accept slashes, it only takes the domain part of a URL.

So, if I try to crawl:

http://www.geocities.com/Area51/Space/

it smokes out Area51/Space and crawls:

http://www.geocities.com/

which practically is a nightmare..;)

BUT, i should make it clear that http://www.geocities.com/Area51/Space/ is also crawled. I think the reason why the rest of geocities is crawled is because the page contains links to other geocities sites or to the geocities homepage. This is a problem because you may be losing control over which sites to include especially if they are hosted on free servers. I have had experience with another search engine software called xavatoria, which is written in Perl, that addresses this problem quite well. Whichever site I crawl, no matter what folder, it only includes pages within the given folder and the subfolders it links to. So:

http://www.geocities.com/Area51/Space/

is treated as the root URL, and pages like

http://www.geocities.com/Area51/Space/index.html
http://www.geocities.com/Area51/Space/aboutme/
http://www.geocities.com/Area51/Space/contactme/
http://www.geocities.com/Area51/Spac...es/movies.html
http://www.geocities.com/Area51/Spac...tes/books.html

are crawled exclusively.

I hope I'm making some sense here.
:)

Thanks.

Charter 10-29-2003 03:44 AM

Hi. Yep, that makes sense. As a temporary solution, how about setting the index levels to one when crawling those links via browser or text file?

druesome 11-05-2003 11:45 AM

Hmm.. can you show me how to do that through a text file?

Charter 11-07-2003 11:35 AM

Hi. Perhaps try the following.

Set the following in the config file:
PHP Code:

define('SPIDER_MAX_LIMIT',1);       // default = 20
define('SPIDER_DEFAULT_LIMIT',1);   // default = 3
define('RESPIDER_LIMIT',1);         // default = 4 

Also, set the following in the config file:
PHP Code:

define('LIMIT_DAYS',0);             // default = 7 

Then make a text file with a list of full URLs, one per line, and try indexing from shell.


All times are GMT -8. The time now is 06:58 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.