PDA

View Full Version : Indexing "<word>-<word>"?


FaberFedor
01-28-2005, 02:12 PM
I haven't found this in the docs or the FAQs (or anywhere else for that matter) so I'm asking here.

How do I get PHPDig to index two (or more) words with a hyphen in them as one search-item (as opposed to two seach-items)?

For example: the web page contains "foo-bar". After indexing, I can search for "foo", "bar", "foo bar" but NOT "foo-bar". I'd like to be able to search for "foo-bar" as well.

Suggestions?

FaberFedor
01-31-2005, 08:30 AM
Here's what I've found so far:

According to the docs, dashes (and many other special characters) are allowed in indexes and searches since v1.8. Yet, in
phpdig_functions.php there is a function called phpdigEpureText() that seems to be removing the special characters that the docs say are allowed.

Ho, ho! There is also an entry in search_function.php that removes various characters from the search functionality! If you also remove the dash from $what_query_chars in this file and reindex, you can now search for words with dashes in them!

At least it worked for me.

Charter
02-02-2005, 07:42 PM
The $what_query_chars variable negates a class of characters; the same goes for the phpdigEpureText function:

$what_query_chars = "[^".$phpdig_words_chars[PHPDIG_ENCODING]." \'.\_~@#$:&\%/;,=-]+";

if (eregi($what_query_chars,$query_to_parse)) {
$query_to_parse = eregi_replace($what_query_chars," ",$query_to_parse);
}

$text = ereg_replace('[^'.$phpdig_words_chars[$encoding].' \'._~@#$:&%/;,=-]+',' ',$text);

When you removed the dash from the class of characters, you essentially replaced the dash in a word with a space, so if you search on foo-bar, PhpDig will then search on foo and/or bar, not the whole word foo-bar.

Try searching on t-shirts (http://www.phpdig.net/demo/search.php?query_string=t-shirts) in the online demo. When PhpDig finds a word containing a dash in the chunk it's trying to process, it will try to highlight it. Also, try running the following query, and then search on some of the resultant words:

# add your table prefix if needed
SELECT keyword FROM keywords WHERE keyword LIKE '%-%';

What version of PhpDig are you running, and what kinds of search results do you get?

Note that when processing search requests, PhpDig displays the DISPLAY_SNIPPETS_NUM number of snippets, so if you are searching on several words, as soon as PhpDig hits DISPLAY_SNIPPETS_NUM, it quits looking for things to highlight.

Also, if you set DISPLAY_SNIPPETS to false and DISPLAY_SUMMARY to true, PhpDig will not consider DISPLAY_SNIPPETS_NUM and just display the first words of a page, highlighting only if the search words are within the first words of a page.

FaberFedor
02-04-2005, 06:18 AM
Quote:
When you removed the dash from the class of characters, you essentially replaced the dash in a word with a space, so if you search on foo-bar, PhpDig will then search on foo and/or bar, not the whole word foo-bar.
/Quote

Okay, that explains why the results highlight "foo bar" and not "foo-bar". There is no "foo-bar" in the database tables.

So how do I get phpDig to index "foo-bar"?

I'm running phpDig 1.8.7.

Charter
02-05-2005, 01:39 AM
PhpDig v.1.8.7 should index foo-bar as a word, assuming that the dash is a literal dash and the word foo-bar isn't caught up in some JavaScript. Also, if the hyphened word is longer than MAX_WORDS_SIZE, then it won't get inserted into the database table as a keyword. Try making a demo page with some hyphened words, and after you index it, see if you can search and find the hyphened words.

FaberFedor
02-08-2005, 03:31 PM
Try making a demo page with some hyphened words, and after you index it, see if you can search and find the hyphened words.

Done. Go to http://www.linuxnj.com and click on the "Ignore me" link. That will take you to the page with "Omni-Kuff" in it.

Go to http://www.linuxnj.com/search/search.php and search for "omni-kuff". No go. "omni","kuff" and "omni kuff" will work fine.

I'm going to start whittling the page in question to see if it's something in the page...

Charter
02-08-2005, 04:01 PM
Okay, thanks, I see what you mean. I was probably testing using a modified version by mistake. Anyway, in PhpDig v.1.8.7, find the phpdigCleanHtml function in robot_functions.php, look for the following line, and try removing the dash in the character class.

$text = eregi_replace("[*{}()\"\r\n\t-]+"," ",$text);

FaberFedor
02-08-2005, 04:41 PM
Perfect!

Thanks loads!

Now let's see if I fixed the cron problem and everybody will be happy! :-)

djavet
02-22-2005, 05:09 AM
Hello

Same problem, bu not resolved.
I've too 1.8.7

Make a search with "0-26-110318-0" (ISBN Number):
http://www.john-howe.com/search/search.php?
template_demo=phpdig.html&result_page=search.php... (http://www.john-howe.com/search/search.php?template_demo=phpdig.html&result_page=search.php&browse=1&query_string=0-26-110318-0+&limite=10&option=start)

The indexed page:
http://www.john-howe.com/portfolio/gallery/details.php?image_id=76
The isbn number is under the pix.

I can find it, but it's not display with the hyphen...
How can I make this, to correct the displayed results?

Regards, Dom

PS: I drop the DB and reindex the site again to be sure, but doesn't see that had something to do with the hyphen case...

Charter
02-22-2005, 04:38 PM
http://www.phpdig.net/forum/showpost.php?p=7713&postcount=7 :confused:

djavet
02-22-2005, 11:01 PM
Hello,

Me too i'm confused...
I've on my robot_functions.php around line 147:

function phpdigCleanHtml($text) {
//htmlentities
global $spec;

//replace blank characters by spaces
$text = eregi_replace("[*{}()\"\r\n\t]+"," ",$text);
//$text = ereg_replace("[\r\n\t]+"," ",$text); // original


without "-" after "\t"!

and around line 138 in search_function.php:

$what_query_chars = "[^".$phpdig_words_chars[PHPDIG_ENCODING]." \'.\_~@#$:&\%/;,=]+"; // epure chars \'._~@#$:&%/;,=-

if (eregi($what_query_chars,$query_to_parse)) {
$query_to_parse = eregi_replace($what_query_chars," ",$query_to_parse);
}

$query_to_parse = ereg_replace('(['.$phpdig_words_chars[PHPDIG_ENCODING].'])[\'.\_~@#$:&\%/;,=-]+($|[[:space:]]$|[[:space:]]['.$phpdig_words_chars[PHPDIG_ENCODING].'])','\1 \2',$query_to_parse);

$query_to_parse = trim(ereg_replace(" +"," ",$query_to_parse)); // no more than 1 blank


And nothing!

What I can't understand, it when I'm looking at the temp file in "text_content" folder, it so written without "-":

...SIBLEY HarperCollinsPublishers
ISBN 0 26 110318 0 September 2, 1994 R****m House Audio: The Two Towers CD...



Can't understand...
A lots of thx for your help and time.

Regards, Dominqiue

Charter
02-22-2005, 11:53 PM
This part is correct, no "-" after the "\t".

I've on my robot_functions.php around line 147:

function phpdigCleanHtml($text) {
//htmlentities
global $spec;

//replace blank characters by spaces
$text = eregi_replace("[*{}()\"\r\n\t]+"," ",$text);
//$text = ereg_replace("[\r\n\t]+"," ",$text); // original


without "-" after "\t"!

This part is not correct, change it back to the original code.

and around line 138 in search_function.php:

$what_query_chars = "[^".$phpdig_words_chars[PHPDIG_ENCODING]." \'.\_~@#$:&\%/;,=]+"; // epure chars \'._~@#$:&%/;,=-

if (eregi($what_query_chars,$query_to_parse)) {
$query_to_parse = eregi_replace($what_query_chars," ",$query_to_parse);
}

$query_to_parse = ereg_replace('(['.$phpdig_words_chars[PHPDIG_ENCODING].'])[\'.\_~@#$:&\%/;,=-]+($|[[:space:]]$|[[:space:]]['.$phpdig_words_chars[PHPDIG_ENCODING].'])','\1 \2',$query_to_parse);

$query_to_parse = trim(ereg_replace(" +"," ",$query_to_parse)); // no more than 1 blank


And nothing!

djavet
02-23-2005, 12:10 AM
Hello,

Thx... but I've always the problem.
I replace search_functions.php with the original 1.8.7 file.
An keep the "robot_functions.php" without "-".

I delete and reindex the page again and in my temp file, I again the ISBN code without "-":

...SIBLEY HarperCollinsPublishers
ISBN 0 26 110318 0 September 2, 199...

Indexed page:
http://www.john-howe.com/portfolio/gallery/details.php?image_id=76

Sorry, but I'm really confused...

Dom

Charter
02-23-2005, 12:28 AM
Spidering in progress... [Stop spider]
SITE : http://www.john-howe.com/
Exclude paths :
- ads/
- cgi-bin/
- fataneh/gallery/admin/
- flash/
- forum/
- guestbook/
- linkchecker/
- links/
- links/admin/
- mailinglist/
- news/pm/
- portfolio/gallery/admin/
- search/
- stuff/gallery/admin/
- webmail/
1:http://www.john-howe.com/portfolio/gallery/details.php?image_id=76
(time : 00:00:13)
No link in temporary table
links found : 1
http://www.john-howe.com/portfolio/gallery/details.php?image_id=76
Optimizing tables...
Indexing complete ! [Back] to admin interface.


Results 1-1, 1 total, on "ISBN" (0.05 seconds)

1. [100.00 %] :// John Howe :: Illustrator [ Portfolio ] / From Hobbiton to Mordor / Gandalf Before the Walls of Minas Tirith
limit to http://www.john-howe.com/, this path : portfolio/gallery/

...994 The Map of Tolkien's Middle-Earth Brian SIBLEY HarperCollinsPublishers ISBN - 0-26-110318-0 September 2, 1994 R****m House Audio: The Two Towers -...


Results 1-1, 1 total, on "0-26-110318-0" (0.02 seconds)

1. [100.00 %] :// John Howe :: Illustrator [ Portfolio ] / From Hobbiton to Mordor / Gandalf Before the Walls of Minas Tirith
limit to http://www.john-howe.com/, this path : portfolio/gallery/

... Map of Tolkien's Middle-Earth Brian SIBLEY HarperCollinsPublishers ISBN - 0-26-110318-0 September 2, 1994 R****m House Audio: The Two Towers - CD fro...

The only thing changed was:

//replace foo characters by space
$text = eregi_replace("[*{}()\"\r\n\t-]+"," ",$text);

to the following:

//replace foo characters by space
$text = eregi_replace("[*{}()\"\r\n\t]+"," ",$text);

Put back the original robot_functions.php and then make sure that you only take out the "-" and then delete the page, run the cleans, and then reindex.

djavet
02-23-2005, 12:46 AM
:bang: I drop database, folder, all. and I've made a fresh install with only the change into robot_functions.php and... nothing..
Always the damn same!

You're version is 1.8.8 rc1 no? My version is 1.8.7, maybe that's the point...
Dont's know. I'm the only one with that problem with my version?
I can't upgrade to 1.8.8 rc1 due to my host DB version...

A bug into the 1.8.7?

Regards, Dom
PS: I'm really sorry to bother you with that.

Charter
02-23-2005, 12:53 AM
I'm using v.1.8.7 when testing with you.

djavet
02-23-2005, 01:38 AM
:confused: heuuu, I just download a fresh version and reinstall the whole thing, always the same thing...

May I ask you to send me a zip file with your 1.8.7 version? So I can reinstall with your working version to compare with mine?

A lot of thx for your help.
Kindest regards, Dominique

Charter
02-23-2005, 02:15 AM
No need, you have what I have. I don't understand why removing the dash works for FaberFedor but not for you. :confused:

djavet
02-23-2005, 09:01 PM
Me too I don't understand.
It is possible that you use a feature from PHP and not my host?
-> http://www.john-howe.com/phpinfo.php


Regards, Dom

Charter
02-26-2005, 07:29 PM
I don't think PHP is an issue. When I get a chance, I'll do another fresh install of PhpDig v.1.8.7 and see how it works.

djavet
02-27-2005, 03:12 AM
Humm I come back for this "-" case. I install Phpdig 1.8.7 localy with the package Easyphp 1.7 (http://www.easyphp.org/). Always the same....
I can't understand wha'ts happend!

Any info, tricks, same experience?

Regards, Dom

Charter
02-27-2005, 10:51 PM
Okay, so I did a fresh install of PhpDig v.1.8.7 and the only change I made to the package was this:

In robot_functions.php find the phpdigCleanHtml function, and in this function find:

//replace foo characters by space
$text = eregi_replace("[*{}()\"\r\n\t-]+"," ",$text);

And replace that with the following:

//replace foo characters by space
$text = eregi_replace("[*{}()\"\r\n\t]+"," ",$text);

I then indexed http://www.john-howe.com/portfolio/gallery/details.php?image_id=76 from the admin panel textbox and afterwards did a search on ISBN and 0-26-110318-0 with the following search results being shown:

Results 1-1, 1 total, on "ISBN" (0.01 seconds)

1. [100.00 %] :// John Howe :: Illustrator [ Portfolio ] From Hobbiton to Mordor / Gandalf Before the Walls of Minas Tirith
limit to http://www.john-howe.com/, this path : portfolio/gallery/

...994 The Map of Tolkien's Middle-Earth Brian SIBLEY HarperCollinsPublishers ISBN - 0-26-110318-0 September 2, 1994 R****m House Audio: The Two Towers -...


Results 1-1, 1 total, on "0-26-110318-0" (0.01 seconds)

1. [100.00 %] :// John Howe :: Illustrator [ Portfolio ] From Hobbiton to Mordor / Gandalf Before the Walls of Minas Tirith
limit to http://www.john-howe.com/, this path : portfolio/gallery/

... Map of Tolkien's Middle-Earth Brian SIBLEY HarperCollinsPublishers ISBN - 0-26-110318-0 September 2, 1994 R****m House Audio: The Two Towers - CD fro...

Now the PHP eregi_replace function has been around for a while, so I don't think that the issue has anything to do with PHP, and because with the above change, I can get "0-26-110318-0" in the search results, I don't think it's a PhpDig issue either. Maybe it's a permissions issue, but without further information, I cannot be sure why you are having an issue with dashes.

djavet
02-27-2005, 11:03 PM
Hello Charter,

I can't see your reply. It's empty...
It's a new feature? :)
Last week I can see your post, but no more now. Do change something into the forum?

I will paid in the future, but I don't like the way you force for the knowlege.
Why do you have change this?
I've planed to make a donation when I finish the site search with a working script.

Regards, Dom

Admin Edit: See http://www.phpdig.net/forum/showthread.php?t=1745

djavet
02-28-2005, 03:35 AM
Thx a lot.
I've found a little bug for my case (Phpdig 1.8.7) :banana: .

Find in robot_functions.php, line 190:
//replace foo characters by space
$text = eregi_replace("[*{}()\"\r\n\t-]+"," ",$text);


and replace with (without "-"):
//replace foo characters by space
$text = eregi_replace("[*{}()\"\r\n\t-]+"," ",$text);


And the working case:
http://www.john-howe.com/search/...&query_string=barad-d%FBr (http://www.john-howe.com/search/search.php?template_demo=phpdig.html&result_page=search.php&browse=1&query_string=barad-d%FBr)

A lot of thx for yoru help and time.
Regards, Dom