Howdy Folks,
I just installed PhpDig today and impressed with what I've seen so far.
I want to use PhpDig to index specialized game development blogs. I am only interested in indexing the blog articles themselves and wish to ignore all other content on the blog website. You can view an example blog (mine) at this location:
http://www.gametableonline.com/blogs/wizwar/index.php
I need the spider to explore all documents on a website, but only index documents with an url that contains "article.php". While I can modify my blogs, I cannot modify the blog software GTO uses and even if I could, I'd have to modify several installations since every GTO project has a blog.
I can identity if an URL is an actual blog article because it will contain the pattern "article.php?story=<story id>". The only way I can get links to the available blogs is by extracting links from the index.php document (which paginates). So, in order to get JUST article links I need to look at any urls contain index.php to extract the links, and I need to index documents that contain the pattern "article.php".
I've managed to modify the phpdigRewriteUrl function to return -1 (ignore, discard?) for Urls that don't contain article.php or index.php:
Code:
if (!eregi("article.php|index.php", $eval)) {
return -1;
}
It works very well. Using this method the spider only indexes urls containing index.php or article.php. Due to the dynamic nature of the blog software, the search results aren't very helpful.
Unfortunately, the index.php document returns a brief summary for each available blog in addition to a direct link. When I search for anything, index.php will usually have a higher result score because each index.php page has summaries of 10 blog articles per page. So, usually before I get any results directly to blog articles that contain my keyword, I get several links to index.php documents.
Given how the PhpDig system works, what do think is the best way for me to modify the system for selective indexing?
Thanks for your time.
Michael McIntosh