PhpDig.net - View Single Post - Full Link Exploration with Selective Content Indexing

Xavian · 10-10-2004, 08:32 PM

Thanks for the welcome

Quote:

Looks like you've gone a long way already toward solving your problem. I'm not sure just how well this will fit into to what you're trying to do, but you can exclude certain text from being indexed (see this thread).

I've already found those tags and unfortunately, I do not have direct control over the content. I have access to *my* blog, but I am trying to index all game development blogs associated with the gametableonline.com website. My blog is one of something like 10 other blogs. We discuss relevant topics that come up during our projects, and sometimes a problem one developer runs into is one another developer has already discussed. The individual blog sites are searchable, but it involves going to each and every blog and searching for the keyword you want.

I supposed as a severe hack, I could run a filter on a document after it has been fetched by the spider so that an exclude tag is embedded immediately after the <body> tag of a document. I really want a more graceful method of doing it if possible.

Quote:

If you don't want to see summaries or page descriptions in the search results, make sure you have the following values in config.php:

Ah, I do want to see the summaries, the problem is that the search relevancy algorithm (term frequency inverse document frequency based?) returns results I'd like to filter out. I love the summary of each doc and am impressed at the keyword highlighting.

Unfortunately, since the index.php contains summaries of full documents, they get a really large relevancy boost because most document summaries consist of the really important keywords. I need some way to extract links from any index.php docs and ignore the text content of that index.php doc from the spider side since I normally cannot modify the websites themselves. I want the spider to traverse as many links as possible, but drop all text content but content from urls paths containing "article.php".

Quote:

Also, and again I don't know how your site is structured, if you have items that you don't want indexed that are restricted to a specific directory, have a look at this thread.

I'm familiar with robots.txt, but I lack that level of access to the websites in question. On top of that, that won't really help me with the problem I am having. The only programmatic method I have for extracting articles from the blogs is by getting urls generated from the index.php script. I could manually goto the websites and add each and every article to my list by hand, but thats ultimately unworkable for me and is what I'm trying to avoid.

Leave it to me to have a "Square Peg, Round Hole" problem. ;P I better go get a hammer... ;P

I work with industrial strength search engine solutions by day, but heavens knows I cannot afford the licensing required to use them for my small personal projects. Something like PhpDig is really nice and I am impressed with the quality. A lot of other projects seem to have very little documentation, but you guys even have forums.

Woot!

-Michael