 |
|
PhpDig.net
|
What is PhpDig?
PhpDig is a PHP MySQL based
Web Spider & Search Engine.
|
What's PhpDig
PhpDig is a web spider and search engine written in PHP, using a MySQL database and
flat file support. PhpDig builds a glossary with words found in indexed pages. On a
search query, it displays a result page containing the search keys, ranked by occurrence.
Demo PhpDig
Fill the form with some words and click on the Go button. Note that only a portion of PhpDig.net was
indexed for this demo. If you wish to perform a complete search of the forums, please use
this link.
PhpDig can perform "and operator, exact phrase, or operator" searches. Prior to verion 1.8.0,
PhpDig had different options. You can exclude a word by putting the "-" character before it.
Search on
apache sirvir
to demo the "did you mean" fuzzy.
PhpDig Features
HTTP Spidering : PhpDig follows HREF links as shown by any web browser to find the pages to index.
Links can also be in AreaMap, frames, or simple JavaScript. PhpDig supports redirections and indexes by
following links. PhpDig does not traverse directories or database tables to index content.
By default, PhpDig does not go outside of the domain you define for the indexing. Various index options
are choosen by the user, including a parameter to extend indexing to subdomains and a parameter to
limit the indexing to a specific directory.
You can limit indexing so that the maximum links found is ((X * Y) + 1) where X is links and Y is depth.
Alternatively, you can index just one page, or you can set options to index a greater number of pages.
Any HTML content is indexed, for example from static HTML pages to dynamic HTML pages produced from say PHP
scripts. PhpDig searches the Mime-Type of the document, and can be set to auto-index via a cron job.
Full-Text Indexing : PhpDig indexes all words of a document, but you can avoid common words by defining
such words in a text file. Underscores and other characters can be part of a word. Words in the title can have
a more important weight in ranking results.
Note that the MySQL FULLTEXT index is different from the PhpDig full-text indexing. The MySQL FULLTEXT index
is a table index used with MyISAM tables. PhpDig does full-text indexing of page content but does not use the
MySQL FULLTEXT index for searches.
Indexed File Types :
PhpDig indexes HTML and text files by itself. PhpDig could index PDF, MS-Word, MS-Excel, and MS-PowerPoint files
if you install external binaries on the server for this purpose.
To demonstrate the external binaries feature, you can search
Hamlet
(tragedy, Shakespeare, from MS-Word format) or
L'Avare
(comedy, Molière, from PDF format).
Other Features : PhpDig tries to read a robots.txt file at the server web root, and considers
META robots tags too. The last-modified header value is stored in the database to avoid redundant
indexing. Also, the meta revisit-after tag is considered.
Limits : Because of the time consuming indexing process, PHP must not be safe_mode configured and the
server that performs the index must not timeout. Also, the PHP allow_url_fopen option must be enabled. It doesn't
matter for the search queries.
Spidering and indexing is a bit slow, as there is a decent amount of processing needed to index pages.
On the other hand, search queries are fast enough, even in a somewhat extended context.
|