PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 10-10-2004, 03:53 PM   #1
Xavian
Green Mole
 
Xavian's Avatar
 
Join Date: Oct 2004
Location: Western Massachusetts
Posts: 6
Full Link Exploration with Selective Content Indexing

Howdy Folks,

I just installed PhpDig today and impressed with what I've seen so far.

I want to use PhpDig to index specialized game development blogs. I am only interested in indexing the blog articles themselves and wish to ignore all other content on the blog website. You can view an example blog (mine) at this location: http://www.gametableonline.com/blogs/wizwar/index.php

I need the spider to explore all documents on a website, but only index documents with an url that contains "article.php". While I can modify my blogs, I cannot modify the blog software GTO uses and even if I could, I'd have to modify several installations since every GTO project has a blog.

I can identity if an URL is an actual blog article because it will contain the pattern "article.php?story=<story id>". The only way I can get links to the available blogs is by extracting links from the index.php document (which paginates). So, in order to get JUST article links I need to look at any urls contain index.php to extract the links, and I need to index documents that contain the pattern "article.php".

I've managed to modify the phpdigRewriteUrl function to return -1 (ignore, discard?) for Urls that don't contain article.php or index.php:

Code:
if (!eregi("article.php|index.php", $eval)) {  
   return -1;
}
It works very well. Using this method the spider only indexes urls containing index.php or article.php. Due to the dynamic nature of the blog software, the search results aren't very helpful.

Unfortunately, the index.php document returns a brief summary for each available blog in addition to a direct link. When I search for anything, index.php will usually have a higher result score because each index.php page has summaries of 10 blog articles per page. So, usually before I get any results directly to blog articles that contain my keyword, I get several links to index.php documents.

Given how the PhpDig system works, what do think is the best way for me to modify the system for selective indexing?

Thanks for your time.

Michael McIntosh

Last edited by Xavian; 10-10-2004 at 03:59 PM.
Xavian is offline   Reply With Quote
Old 10-10-2004, 08:00 PM   #2
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Welcome to the forum, Xavian. We're glad to have you here.

Looks like you've gone a long way already toward solving your problem. I'm not sure just how well this will fit into to what you're trying to do, but you can exclude certain text from being indexed (see this thread).

If you don't want to see summaries or page descriptions in the search results, make sure you have the following values in config.php:
Code:
define('DISPLAY_SNIPPETS',false);
define('DISPLAY_SUMMARY',false);
Also, and again I don't know how your site is structured, if you have items that you don't want indexed that are restricted to a specific directory, have a look at this thread.

Hope this helps.
vinyl-junkie is offline   Reply With Quote
Old 10-10-2004, 08:32 PM   #3
Xavian
Green Mole
 
Xavian's Avatar
 
Join Date: Oct 2004
Location: Western Massachusetts
Posts: 6
Thanks for the welcome

Quote:
Looks like you've gone a long way already toward solving your problem. I'm not sure just how well this will fit into to what you're trying to do, but you can exclude certain text from being indexed (see this thread).
I've already found those tags and unfortunately, I do not have direct control over the content. I have access to *my* blog, but I am trying to index all game development blogs associated with the gametableonline.com website. My blog is one of something like 10 other blogs. We discuss relevant topics that come up during our projects, and sometimes a problem one developer runs into is one another developer has already discussed. The individual blog sites are searchable, but it involves going to each and every blog and searching for the keyword you want.

I supposed as a severe hack, I could run a filter on a document after it has been fetched by the spider so that an exclude tag is embedded immediately after the <body> tag of a document. I really want a more graceful method of doing it if possible.

Quote:
If you don't want to see summaries or page descriptions in the search results, make sure you have the following values in config.php:
Ah, I do want to see the summaries, the problem is that the search relevancy algorithm (term frequency inverse document frequency based?) returns results I'd like to filter out. I love the summary of each doc and am impressed at the keyword highlighting. Unfortunately, since the index.php contains summaries of full documents, they get a really large relevancy boost because most document summaries consist of the really important keywords. I need some way to extract links from any index.php docs and ignore the text content of that index.php doc from the spider side since I normally cannot modify the websites themselves. I want the spider to traverse as many links as possible, but drop all text content but content from urls paths containing "article.php".

Quote:
Also, and again I don't know how your site is structured, if you have items that you don't want indexed that are restricted to a specific directory, have a look at this thread.
I'm familiar with robots.txt, but I lack that level of access to the websites in question. On top of that, that won't really help me with the problem I am having. The only programmatic method I have for extracting articles from the blogs is by getting urls generated from the index.php script. I could manually goto the websites and add each and every article to my list by hand, but thats ultimately unworkable for me and is what I'm trying to avoid.

Leave it to me to have a "Square Peg, Round Hole" problem. ;P I better go get a hammer... ;P

I work with industrial strength search engine solutions by day, but heavens knows I cannot afford the licensing required to use them for my small personal projects. Something like PhpDig is really nice and I am impressed with the quality. A lot of other projects seem to have very little documentation, but you guys even have forums. Woot!

-Michael
Xavian is offline   Reply With Quote
Old 10-10-2004, 08:57 PM   #4
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Since you don't have any control over the server, I'm afraid your only option is more custom code. Wish I could be of more help.
vinyl-junkie is offline   Reply With Quote
Old 10-10-2004, 09:34 PM   #5
Xavian
Green Mole
 
Xavian's Avatar
 
Join Date: Oct 2004
Location: Western Massachusetts
Posts: 6
Thanks for all your help anyways, vinyl-junkie. I suspected I'd have to roll my own, I'm just trying to avoid re-inventing the wheel if there is an easy way to introduce this functionality.

Last edited by Xavian; 10-10-2004 at 09:49 PM.
Xavian is offline   Reply With Quote
Old 10-11-2004, 09:08 AM   #6
Xavian
Green Mole
 
Xavian's Avatar
 
Join Date: Oct 2004
Location: Western Massachusetts
Posts: 6
Article Comment Tags...

Aha! I think I see a way its possible and I'm curious if you guys have any suggestions on where I should look to implement it...

I've examined the html source code of generated by the blog software and I can see that the article section comment of <!-- ARTICLE START --> and <!-- ARTICLE END --> to mark where the article begins and ends in the html code.

So, a revised filter will explore all documents on the blog website, but only index text contained between the <!-- ARTICLE START --> and <!-- ARTICLE END --> tags.

I'll look at the spider mechanism for text exclusion and see if I can kludge my own.

Another alternative is for me to add results filtering. The template engine code seems complicated, but if I can intercept the query results list before they are rendered, I could iterate the list and remove certain documents based upon URL so that only urls containing "article.php" are output in the search results. That seems the easiest solution actually.

Do you guys have any diagrams of how the system works, what the various tables are used for and stuff like that? I'd be happy to submit my mods if I can get this to work.

-Michael
Xavian is offline   Reply With Quote
Old 10-11-2004, 09:48 AM   #7
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
An idea to try...

In robot_functions.php find:
PHP Code:
foreach ($file_content as $num => $line) {
    if (
trim($line)) {
        if (
$content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) {
            
$exclude true;
        }
        else if (
trim($line) == PHPDIG_INCLUDE_COMMENT) {
            
$exclude false;
            continue;
        } 
and replace with:
PHP Code:
if ($file && eregi("index.php",$file)) {
    
// tags must be on their own lines
    
$the_exclude_comment "<html>";
    
$the_include_comment "</html>";
}
else {
    
$the_exclude_comment PHPDIG_EXCLUDE_COMMENT;
    
$the_include_comment PHPDIG_INCLUDE_COMMENT;
}

foreach (
$file_content as $num => $line) {
    if (
trim($line)) {
        if (
$content_type == 'HTML' && trim($line) == $the_exclude_comment) {
            
$exclude true;
        }
        else if (
trim($line) == $the_include_comment) {
            
$exclude false;
            continue;
        } 
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-11-2004, 12:03 PM   #8
Xavian
Green Mole
 
Xavian's Avatar
 
Join Date: Oct 2004
Location: Western Massachusetts
Posts: 6
Thats an awesome idea! I'm glad I checked here first! I'll try that when I get a chance this evening... Thanks!

Last edited by Xavian; 10-11-2004 at 12:06 PM.
Xavian is offline   Reply With Quote
Old 10-11-2004, 08:37 PM   #9
Xavian
Green Mole
 
Xavian's Avatar
 
Join Date: Oct 2004
Location: Western Massachusetts
Posts: 6
Thumbs up Its... Its... ALIVE!

Thanks to your great pointers I got the spider and the engine working the way I needed it. The results are great. You can check it out at:

http://michael.nervestaple.com/gto/blogsearch/

Some good example keywords would be "wiz-war" or "game"...

I will be indexing more blogs tommorrow and later on this week I'll post what I modified to show how I did it. I ended up having to modify the spider.php as well, right before the spider calls the phpdigIndexFile function.

For now, I gotta catch some Zzzzzs...

-Michael
Xavian is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Selective Indexing of URL Containing a <keyword> Leith How-to Forum 0 01-21-2008 02:16 AM
Help indexing a folder full of PDF posa External Binaries 13 02-24-2005 12:11 AM
Indexing Dynamic Content greenman How-to Forum 0 11-11-2004 05:40 AM
Indexing the content of a database antalsia How-to Forum 1 01-28-2004 10:53 AM
don't indexing metatags content Christian How-to Forum 3 01-11-2004 04:29 PM


All times are GMT -8. The time now is 01:53 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.