PDA

View Full Version : index only HTML files


bigals
11-25-2003, 07:30 AM
a have indexed my site and it indexes .html and .swf files,
it also indexes the file directory. i.e.: '-'

but i just want the html files to be indexed is there a way of setting this if so how and where because i cant find it anywhere,

the '-' index links are the biggest problem, the swf files don'e really matter,

can anyone please help me!!!

cheers,

alex

Charter
11-25-2003, 09:17 AM
Hi. You might try adding a robots.txt file in web root with the following, assuming it's the index.html to the main site that you don't want to crawl:

User-agent: PhpDig
Disallow: index.html

To remove the '-' index links that were crawled, go to the admin panel, click a site, click the update button, click a blue arrow, and on the right side, click a red X for those links you want to delete.

Another option, if you have shell access, would be to crawl via command line using a text file, where only the links you want crawled are in the text file, one per line. There are three options in the config file (SPIDER_MAX_LIMIT, SPIDER_DEFAULT_LIMIT, RESPIDER_LIMIT) that can be set to limit the number of levels crawled when using shell to index.

bigals
11-26-2003, 12:41 AM
cheers, i meant that within each folder it spiders there is three results for example:

-
hello.txt
hello.swf

the top result is just a '-' but it links to the folder itself so you get a kind of ftp page not a html page, the swf doesn't really matter because i dont think it appears as a result in any searches.

but when i said index i meant the ftp version of the folder in question does that make sense, surely there is a way of tellin phpdig to ONLY index html files and no folders or files without the .html file type

hope this makes more sense,

alex.

Charter
11-26-2003, 08:16 AM
Hi. Is there a link from dir/filename.html to just dir/ in the filename.html files? What are the filenames of the html files? You might try setting a .htaccess file in web root with the following as the first line:

Options -Indexes

For the swf files, try adding swf to the FORBIDDEN_EXTENSIONS list in the config file.

bigals
11-26-2003, 02:50 PM
i dont think there are links from filenames.html to dir/
couldn't i add '-' to the forbidden extentions list or will that just mess it all up?

the html files are named by regions and towns in england, i.e. 'norwich.html', they are not called 'index.html' if thats what you were thinking perhaps.

do you get these directory indexs in you spider results?

alex

Charter
11-26-2003, 03:32 PM
Hi. I wouldn't add '-' to the forbidden extentions because it isn't an extension; it's just a representation for domain.com. Yes, I do get '-' in my results. Did using the .htaccess file work?

bigals
11-26-2003, 03:38 PM
i don't know how to get .htaccess files made or added to my site root, do you get the 'index of blah blah blah' in your search results?

if i type index of into my search field and click go i get a huge list of search results made up of the pages i don't want listed do you get the same?

Charter
11-26-2003, 03:58 PM
Hi. No, I don't get that because I don't allow directory listings. The attached zip file contains a .htaccess file. Just FTP the .htaccess file to your web root in ASCII mode.

bigals
11-26-2003, 04:01 PM
cheers, o.k. i'll have to ask my domain hosts to tell me what my root is because they have set it up and may have changed things round a bit, i'll reply and tell how it goes, cheers!

alex

Charter
11-26-2003, 04:04 PM
Hi. The place to FTP the file is the same place the main index.html file would go for your site. For instance, if your main site page is domain.com/index.html, then the web root is where this index.html file resides.

bigals
11-27-2003, 01:32 AM
aargghh!!! if i place the htaccess file on the server it restricts access to the phpdig administration page, even if i rename the index.php page it still wont allow access.

if that had workied it would've been cool, sorry.
any other ideas on how to avoid this problem, i'd deal with it normally but the index directories get in the way of the actuall relevent results of the serach you see.

cheers,

alex

Charter
11-27-2003, 08:03 AM
Hi. Another option would be to make one filename.html that links to the files that you want crawled and index filename.html at level one. After the index is done, just go to the admin panel, click a site, click the update button, click a blue arrow, and delete the '-' on the right hand side.

bigals
12-01-2003, 10:34 AM
sorry for the long wait,

but i have constantly changing html pages, new ones created regularly, so i need to somehow disable the index directories from being indexed, there must be something in the phpdig that tells the engine to search and display these pages else they wouldn't be indexed,

who may know how to find and disable such a function?

thankyou,

alex

Charter
12-01-2003, 12:13 PM
Do your HTML pages link to the index directories?

bigals
12-01-2003, 12:25 PM
the pages i have made do not link to the index directory, i was wondering the same thing,

the link created by the spider/indexer is a link to the directory not a html file, phpdig is finding the index of a folder and displaying it as a link:

see here is an example i have taken from the site:

http://www.robotstxt.org/wc/

the '-' i get is linking to pages the same as the above link:

so if i search my 'cars' html page on my site i get results that link to addresses like:
(these are made up examples)

'cars/cars.html' and 'cars/'

Charter
12-01-2003, 02:55 PM
Hi. The '-' just indicates an index or default filename with an html, htm, php, asp, or phtml extension. The PHPDIG_DEFAULT_INDEX can be set to false in the config file.

I setup a directory structure like so to test:

http://www.domain.com/
---- index.html
---- test/
---- index.html
---- test.html
---- test2.html

The test.html file linked to the test2.html file. No other links were present. When I crawled http://www.domain.com/test/test.html, only text from the test.html and test2.html files was found in the search results. There was no '-' or text from the index.html files in the search results.

Depending on the number of levels crawled maybe there is a link to a link to a link to the directory listings page? If you have shell access, that is also an option to use.

bigals
12-01-2003, 03:33 PM
fantastic i'll try that and get back to you soon, probably tomorrow morning if i get a chance, thanks!!!!

bigals
12-02-2003, 03:21 AM
i deleted the site index and reindexed the whole thing with the setting false in the config file but it still returned all the '-' links.

i'll try again, just in case the config.php file was not refreshed properly on the server,

i have been indexing at the full 20 levels, so do you think that if i index less levels it may only pick up the normal links, based on the assumption that there is a link to the index pages some 3 links deep or something???

what dya reckon?

cheers,

ALEX