Index update question [Archive]

View Full Version : Index update question

Gecko

10-03-2003, 04:56 AM

Hi, I recently found PhpDig serarching for a good site search engine for my remotely hosted website and I am currently configuring it to suit my needs. I still have a couple of questions, so I hope anyone here can help me along:

- A full index takes hours, and besides spidering and indexing all links, it also finds and indexes all files in the subdirectories. I just want it to spider the links, because the files itself are not complete. Served through the website, the files are embedded in php and css driven templates in order to serve complete html pages. Is there any way that I can tell PhpDig not to find and index files, but just to spider and index all links in html pages?

- What happens when I embed links in the  and  tags? Are hyperlinks placed between those tags being ignored for spidering? Hope so!

- Right now, PhpDig also indexes words and picks up links that are embedded in HTML remark tags . Too bad.

- Wouldn't it be an idea if you could configure PhpDig with a list of files and directories to ignore? Then the spider does not have to spider everything in order to find out that certain pages are not to be indexed when the META ROBOTS tag tells it.

- Is there a possiblility that I can add seperate files to the index through the web interface? I have a news service on my site which is driven by a single php file. Right now it looks like that if I have to add new files to the index, I have to spider the entire news directory. This causes PhpDig to spider 900+ pages right now, and over 1200 next year etc.

Rolandks

10-03-2003, 06:50 AM

Originally posted by Gecko
- Is there any way that I can tell PhpDig not to find and index files, but just to spider and index all links in html pages?

Yes,  this parts. PhpDig is a search-engine, how should it know what you will index and which part not index ? Only by excluding !

- What happens when I embed links in the  and  tags? Are hyperlinks placed between those tags being ignored for spidering? Hope so!

Yes, it works IMHO, try it with one page.

- Right now, PhpDig also indexes words and picks up links that are embedded in HTML remark tags . Too bad.

You are using PHP > 4.3.2 :D it is a Bug see: Indexing HTML-Comments (http://www.phpdig.net/showthread.php?s=&threadid=85)

- Wouldn't it be an idea if you could configure PhpDig with a list of files and directories to ignore? Then the spider does not have to spider everything in order to find out that certain pages are not to be indexed when the META ROBOTS tag tells it.

It is a feature request :rolleyes:. But: PhpDig Tries to read a robots.txt file at the server root. It searches meta robots tags too. Other Workaround: Create a robot.txt with all directories to ignore (Disallow: /my_dir/), Dig the Site, delete robot.txt :)

- Is there a possiblility that I can add seperate files to the index through the web interface? I have a news service on my site which is driven by a single php file. Right now it looks like that if I have to add new files to the index, I have to spider the entire news directory. This causes PhpDig to spider 900+ pages right now, and over 1200 next year etc.
Create a Indexfile with all Links and index this file, after that delete this indexfiles in Update form.

Gecko

10-03-2003, 08:37 AM

Roland, thank you for your advice! :D I think this might just be the trick to speed up the spider process and avoid having to remove hundreds of files by hand every time.

I have one question about your answer on my first question, though:

Originally posted by Rolandks
Yes,  this parts. PhpDig is a search-engine, how should it know what you will index and which part not index ? Only by excluding !

I think this will not work. :( Let me give you an example. In one of my subdirs I have a php script called show.php. This script is used for calling all the files in that subdir and merging it with my template files (show.php?link=a etc) in order to produce complete html output. In the dir and deeper subdirs are also files called a.htm, b.htm etc. These files are only called by the php script, there are no direct links from other html pages on my site. Yet they ARE found and indexed by PhpDig (as is show.php).

In other words: i just want PhpDig to index the URL
.../show.php?link=a (which incorporates a.htm)
but I do not want PhpDig to index the a.htm file itself as it is no web page but just a part of it.

Your suggestion to put Phpdig exclude and include brackets into a.htm would not work, because then the contents are also not indexed when the spider is trying to index show.php?link=a!

If PhpDig spiders the site from the root URL, it should never encounter a.htm, just show.php?link=a. But it doesn't. It does not only spider the links and index the pages found that way, it also reads the remote filesystem and indexes every single file it finds. And that is not what I want it to do.

Charter

10-05-2003, 09:56 AM

Not sure if I understand completely, but you might try setting the following in the config file:

define('SPIDER_MAX_LIMIT',1); //max recurse levels in sipder - default = 20
define('SPIDER_DEFAULT_LIMIT',1); //default value - default = 3
define('RESPIDER_LIMIT',1); //recurse limit for update - default = 4

so that the number of levels crawled is one.

Gecko

10-06-2003, 10:41 AM

I tried this, but unfortunately also without the result I hoped for. Setting the spider depth to 1 causes only my index page and the links found there to be spidered (which is only 5% of the entire site, the rest of the links are found in the next two levels in the site's link tree structure).

Increasing the spider depth to 2 allowed more of my site to be spidered, and more important: no files like I mentioned in the original posting were found. But also still a number of sub-pages weren't being spidered.

Increasing the spider depth one step further to 3 results in the entire site being spidered, but also in indexing all the files in the subdirectories involved which are no part of the link tree.

Seems to be a bug in PhpDig?

Charter

10-06-2003, 03:00 PM

Hi. PhpDig is set to crawl any links it encounters at the given level. Not sure if "called by the php script" means that the PHP script is feeding the a.htm files via a.htm links. Does setting up a robots.txt file in web root so PhpDig doesn't crawl a.htm type files work?

Gecko

10-06-2003, 10:27 PM

Originally posted by Charter
Hi. PhpDig is set to crawl any links it encounters at the given level. Not sure if "called by the php script" means that the PHP script is feeding the a.htm files via a.htm links. Does setting up a robots.txt file in web root so PhpDig doesn't crawl a.htm type files work?

I have over 800 *.htm snippets in several subdirs. There is no link to any of them in HTML, all are handled by PHP-scripts where they are merged with HTML- and CSS- style templates via the PHP include function.

I already excluded the major part of these files by putting their directory as disallowed in robots.txt when the php-script is not located in the same directory as the *.htm snippets. This solves about 60% of the problem. But disallowing the remaining files by naming them directly in robots.txt is still a lot of work, and it is a workaround for a problem that should not exist in the first place.

I still find it curious that when only HTML links are spidered by PhpDig, files are found when there is no HTML-link pointing to it. Previous I used search services like Atomz and Freefind (but they became too limited for my rapidly expanding site); they just spidered the links and nothing else.

rayvd

10-08-2003, 10:37 AM

I'm no phpDig expert (yet!), but I find it highly unlikely that phpDig is reading the remote filesystem. It should ONLY be able to find files (.html, .php or whatever) that are explicitly linked to by a visible index page. This could include a subdirectory that doesn't have an index page, but your webserver has directory listing enabled...

You must have a link somewhere that is pointing to these a, b, c.html files. I can't understand how phpDig would find them otherwise... I don't think it has a module for hacking into servers and reading filesystems :)

Gecko

10-09-2003, 10:16 PM

Originally posted by rayvd
I'm no phpDig expert (yet!), but I find it highly unlikely that phpDig is reading the remote filesystem. It should ONLY be able to find files (.html, .php or whatever) that are explicitly linked to by a visible index page. This could include a subdirectory that doesn't have an index page, but your webserver has directory listing enabled...

I guess that must be it. It only happens in subdirs where no index page is present and where the spider is sent to a html page from at least one link.
Yesterday I put up a little experiment and put a dummy file in one of those subdirs, built up a new index, and ...... it appeared in the index. Next thing I will try is just upload a blank index page, perhaps that will do. Or does anyone know how to disable directory listing when no index file is present?

Charter

10-11-2003, 04:35 AM

Hi. Perhaps make a .htaccess file with the following line

Options -Indexes

and stick the file in the directory to prevent directory listings when no index page is present.