PhpDig.net

What is PhpDig?
PhpDig is a PHP MySQL based
Web Spider & Search Engine.
 


What's PhpDig
PhpDig is a web spider and search engine written in PHP, using a MySQL database and flat file support. PhpDig builds a glossary with words found in indexed pages. On a search query, it displays a result page containing the search keys, ranked by occurrence.
Demo PhpDig
Fill the form with some words and click on the Go button. Note that only a portion of PhpDig.net was indexed for this demo. If you wish to perform a complete search of the forums, please use this link.
 
display results
and operator  exact phrase  or operator 
 
PhpDig can perform "and operator, exact phrase, or operator" searches. Prior to verion 1.8.0, PhpDig had different options. You can exclude a word by putting the "-" character before it. Search on apache sirvir to demo the "did you mean" fuzzy.
PhpDig Features
HTTP Spidering : PhpDig follows HREF links as shown by any web browser to find the pages to index. Links can also be in AreaMap, frames, or simple JavaScript. PhpDig supports redirections and indexes by following links. PhpDig does not traverse directories or database tables to index content.

By default, PhpDig does not go outside of the domain you define for the indexing. Various index options are choosen by the user, including a parameter to extend indexing to subdomains and a parameter to limit the indexing to a specific directory.

You can limit indexing so that the maximum links found is ((X * Y) + 1) where X is links and Y is depth. Alternatively, you can index just one page, or you can set options to index a greater number of pages.

Any HTML content is indexed, for example from static HTML pages to dynamic HTML pages produced from say PHP scripts. PhpDig searches the Mime-Type of the document, and can be set to auto-index via a cron job.

Full-Text Indexing : PhpDig indexes all words of a document, but you can avoid common words by defining such words in a text file. Underscores and other characters can be part of a word. Words in the title can have a more important weight in ranking results.

Note that the MySQL FULLTEXT index is different from the PhpDig full-text indexing. The MySQL FULLTEXT index is a table index used with MyISAM tables. PhpDig does full-text indexing of page content but does not use the MySQL FULLTEXT index for searches.

Indexed File Types : PhpDig indexes HTML and text files by itself. PhpDig could index PDF, MS-Word, MS-Excel, and MS-PowerPoint files if you install external binaries on the server for this purpose.

To demonstrate the external binaries feature, you can search Hamlet (tragedy, Shakespeare, from MS-Word format) or L'Avare (comedy, Molière, from PDF format).

Other Features : PhpDig tries to read a robots.txt file at the server web root, and considers META robots tags too. The last-modified header value is stored in the database to avoid redundant indexing. Also, the meta revisit-after tag is considered.

Limits : Because of the time consuming indexing process, PHP must not be safe_mode configured and the server that performs the index must not timeout. Also, the PHP allow_url_fopen option must be enabled. It doesn't matter for the search queries.

Spidering and indexing is a bit slow, as there is a decent amount of processing needed to index pages. On the other hand, search queries are fast enough, even in a somewhat extended context.


Powered by: vBulletin Version 3.0.7
Copyright ©2000 - 2005, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.