PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 10-27-2003, 05:08 AM   #1
druesome
Orange Mole
 
Join Date: Oct 2003
Posts: 30
What about sites on free hosting?

Hey, I wonder if PHP Dig can accommodate sites such as those hosted by Geocities or Tripod etc. I tried to crawl one site in Geocities but instead it tried to crawl the whole Geocities network!! Is there any way around this?
druesome is offline   Reply With Quote
Old 10-27-2003, 08:22 AM   #2
Rolandks
Purple Mole
 
Rolandks's Avatar
 
Join Date: Sep 2003
Location: Kassel, Germany
Posts: 119
It works for me.

SITE : http://dsyn2.tripod.com/
Exclude paths :
- @NONE@
1:http://dsyn2.tripod.com/main.html

links found : 8
http://dsyn2.tripod.com/main.html
http://dsyn2.tripod.com/link.html
http://dsyn2.tripod.com/rev.html
http://dsyn2.tripod.com/
http://dsyn2.tripod.com/fic.html
http://dsyn2.tripod.com/rant.html
http://dsyn2.tripod.com/more.html
http://dsyn2.tripod.com/rmore.html
Optimizing tables...
Indexing complete !

-Roland-
Rolandks is offline   Reply With Quote
Old 10-27-2003, 08:41 AM   #3
druesome
Orange Mole
 
Join Date: Oct 2003
Posts: 30
Hey,

What I meant are URL's like:

www.geocities.com/Area51/Space/1109 (non-existent)

or perhaps like

www.brinkster.com/druesome (non-existent)

because these are how their URL's are formatted. Totally forgot that tripod gave out subdomains, that's why yours worked, roland.
druesome is offline   Reply With Quote
Old 10-28-2003, 07:18 PM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Does setting PHPDIG_DEFAULT_INDEX to false have any effect?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-28-2003, 09:46 PM   #5
druesome
Orange Mole
 
Join Date: Oct 2003
Posts: 30
Nope. Right now PHPdig cannot accept slashes, it only takes the domain part of a URL.

So, if I try to crawl:

http://www.geocities.com/Area51/Space/

it smokes out Area51/Space and crawls:

http://www.geocities.com/

which practically is a nightmare..

BUT, i should make it clear that http://www.geocities.com/Area51/Space/ is also crawled. I think the reason why the rest of geocities is crawled is because the page contains links to other geocities sites or to the geocities homepage. This is a problem because you may be losing control over which sites to include especially if they are hosted on free servers. I have had experience with another search engine software called xavatoria, which is written in Perl, that addresses this problem quite well. Whichever site I crawl, no matter what folder, it only includes pages within the given folder and the subfolders it links to. So:

http://www.geocities.com/Area51/Space/

is treated as the root URL, and pages like

http://www.geocities.com/Area51/Space/index.html
http://www.geocities.com/Area51/Space/aboutme/
http://www.geocities.com/Area51/Space/contactme/
http://www.geocities.com/Area51/Spac...es/movies.html
http://www.geocities.com/Area51/Spac...tes/books.html

are crawled exclusively.

I hope I'm making some sense here.


Thanks.
druesome is offline   Reply With Quote
Old 10-29-2003, 03:44 AM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Yep, that makes sense. As a temporary solution, how about setting the index levels to one when crawling those links via browser or text file?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 11-05-2003, 11:45 AM   #7
druesome
Orange Mole
 
Join Date: Oct 2003
Posts: 30
Hmm.. can you show me how to do that through a text file?
druesome is offline   Reply With Quote
Old 11-07-2003, 11:35 AM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Perhaps try the following.

Set the following in the config file:
PHP Code:
define('SPIDER_MAX_LIMIT',1);       // default = 20
define('SPIDER_DEFAULT_LIMIT',1);   // default = 3
define('RESPIDER_LIMIT',1);         // default = 4 
Also, set the following in the config file:
PHP Code:
define('LIMIT_DAYS',0);             // default = 7 
Then make a text file with a list of full URLs, one per line, and try indexing from shell.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Ever offer free hosting? Charter The Mole Hole 2 08-25-2003 01:08 AM


All times are GMT -8. The time now is 09:34 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.