Choosy about domains? [Archive]

View Full Version : Choosy about domains?

druesome

10-19-2003, 08:40 AM

Hi, for the last few days I've been spidering without a single hitch, until today. The last website I tried to spider has the .ph domain and I wonder if that could be the reason it could not be spidered. If you could try it out for me, the URL is http://www.birdwatch.ph ..

And lastly, I also noticed that when I spider a site that is hosted under Geocities, the site_url becomes www.geocities.com without including the folder where the site really is. (e.g. www.geocities.com/mysite). Is there a way around this? It may seem like a weird request but I really really need it to be this way coz I'm working on a hack that will benefit from it. Thanks in advance!!

Charter

10-19-2003, 11:31 AM

Hi. What message did you get when you tried to crawl birdwatch.ph? Does setting PHPDIG_DEFAULT_INDEX to false in the config file have any effect?

druesome

10-19-2003, 09:26 PM

I already tried that yesterday, but didn't work. Actually, when I try to spider the site, it times out and would seem like nothing's happened. When I refresh the admin page, the URL is added to the list however no page is crawled.

Any ideas about my other question? Thanks.

bloodjelly

04-19-2004, 05:47 PM

I'm actually curious about druesome's second question as well, and found this thread searching for the answer, but no answer yet. Why does phpDig erase the folder name to a site when it stores the URL? I just searched http://gino.go-gaia.com/forum and it worked well, sticking to that directory, but in the admin panel the link has the forum directory removed. Sorry if this is an easy question but can I make phpDig leave the format of the URL I spidered alone? So that if I spider http://gino.go-gaia.com/forum then that URL will be in the sites table? Thanks.

Charter

04-20-2004, 12:49 PM

Hi. As to birdwatch.ph what do you get onscreen when you uncomment //print $answer."<br>\n"; in the robot_functions.php file?

WRT the admin index page, it shows only the site, domain or subdomain as the case may be. This is based off of parse_url (http://www.php.net/manual/en/function.parse-url.php) (see below code). To view the directories/branches for a specific (sub)domain, just click the site and then click the update button.

<?php

$link = "http://foo.domain.com/dir1/dir2/dir3/file.php?a=b&c=d#anchor";
print_r(parse_url($link));

/* start output
Array
(
[scheme] => http
[host] => foo.domain.com
[path] => /dir1/dir2/dir3/file.php
[query] => a=b&c=d
[fragment] => anchor
)
end output */

// foo.domain.com gets stored as http://foo.domain.com/

?>

bloodjelly

04-20-2004, 04:52 PM

How about if I wanted to store the directory information exactly as entered in the spider script in the spider's "sites" table? Or am I missing something...

Charter

04-20-2004, 05:28 PM

Hi. To get a feel for how it works, look through the tables and see how the domain is stored in the sites table and path/file info is stored in the spider/tempspider/excludes tables, and then search the robot_functions.php file for the parse_url function.

bloodjelly

04-30-2004, 12:39 AM

Thanks Charter - my host lost all MySQL for about a week (no explaination why) so I haven't been able to try this, but I will ASAP. Thanks for pointing me in the right direction.