PDA

View Full Version : limit search to contents of HTML tags?


beesman
12-14-2004, 02:31 AM
Hi all,

I'm testing PhpDig for the first time, & while this forum is a great resource, having trawled through all the messages I can't find a solution to my problem, so any help would be greatly appreciated.

Say I have a number of HTML files with the same structure, e.g. articles with a title in <h2></h2> tags, sub-heading in <h3></h3> tags & the main content in <p class="main"></p> paras. Is it possible to set up PhpDig so that, for example, users can query title text only? Or is there an indexing solution to this issue?

Thanks in advance.

vinyl-junkie
12-14-2004, 03:24 AM
Welcome to the forum, beesman. :D

Searching by title within a page is not something can phpdig was designed to do. I don't know how much interest there would be in doing so, but phpdig could probably be easily modified to search by web page titles, but that's probably the only type of change like this that Charter would be willing to make.

Hope this helps.

beesman
12-14-2004, 06:24 AM
Hi, & thanks for the speedy reply :)

Just say no if it's a request too far, but could you point me in the direction of the relevant file &/or chunk of code that I'd have to play with?

Many thanks

vinyl-junkie
12-14-2004, 06:02 PM
Look at admin/spider.php. That is probably what you'd need to modify to make the kind of search index you want.

Hope this helps. :)

Spider
12-15-2004, 02:41 AM
@beesman: you are one day ahead of me posting this question. I will look in to it, but I'm a php-newbie. If you or somebody writes the solution I like to use it too.

Charter
12-15-2004, 02:53 AM
Look at the phpdigCleanHtml function in the robot_functions.php file.

Spider
12-15-2004, 07:05 AM
I placed this in robot_functions.php at line 161

$text = eregi_replace("<td[^>]*>.*</td>"," ",$text);

Because all my content is between td-tags, I thought then phpdig would show me nothing. But phpdig still finds everything. Did I make a mistake?

Charter
12-15-2004, 07:23 AM
Did you reindex?

Spider
12-15-2004, 08:06 AM
Yes, emptied the database and reindexed.

Charter
12-15-2004, 08:17 AM
So you have the following?

$text = eregi_replace("<td[^>]*>.*</td>"," ",$text);
$text = preg_replace("/<[\/\!]*?[^<>]*?>/is"," ",$text);

The first removes stuff between <td...> and </td> (according to CHUNK_SIZE) and the second removes other tag-like things, so you don't really need the first one. If you want to exclude part of a page, look at this (http://www.phpdig.net/forum/showthread.php?t=1430) thread or look at how $title is set in the phpdigCleanHtml function in the robot_functions.php file.

Spider
12-15-2004, 11:41 AM
Thanks Charter, the phpdigExclude and phpdigInclude does it for me! I didn't see that function till now.

:santa: