PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   How-to Forum (http://www.phpdig.net/forum/forumdisplay.php?f=33)
-   -   Indexing "<word>-<word>"? (http://www.phpdig.net/forum/showthread.php?t=1788)

FaberFedor 01-28-2005 02:12 PM

Indexing "<word>-<word>"?
 
I haven't found this in the docs or the FAQs (or anywhere else for that matter) so I'm asking here.

How do I get PHPDig to index two (or more) words with a hyphen in them as one search-item (as opposed to two seach-items)?

For example: the web page contains "foo-bar". After indexing, I can search for "foo", "bar", "foo bar" but NOT "foo-bar". I'd like to be able to search for "foo-bar" as well.

Suggestions?

FaberFedor 01-31-2005 08:30 AM

Here's what I've found so far:

According to the docs, dashes (and many other special characters) are allowed in indexes and searches since v1.8. Yet, in
phpdig_functions.php there is a function called phpdigEpureText() that seems to be removing the special characters that the docs say are allowed.

Ho, ho! There is also an entry in search_function.php that removes various characters from the search functionality! If you also remove the dash from $what_query_chars in this file and reindex, you can now search for words with dashes in them!

At least it worked for me.

Charter 02-02-2005 07:42 PM

The $what_query_chars variable negates a class of characters; the same goes for the phpdigEpureText function:
Code:

$what_query_chars = "[^".$phpdig_words_chars[PHPDIG_ENCODING]." \'.\_~@#$:&\%/;,=-]+";

if (eregi($what_query_chars,$query_to_parse)) {
        $query_to_parse = eregi_replace($what_query_chars," ",$query_to_parse);
}

$text = ereg_replace('[^'.$phpdig_words_chars[$encoding].' \'._~@#$:&%/;,=-]+',' ',$text);

When you removed the dash from the class of characters, you essentially replaced the dash in a word with a space, so if you search on foo-bar, PhpDig will then search on foo and/or bar, not the whole word foo-bar.

Try searching on t-shirts in the online demo. When PhpDig finds a word containing a dash in the chunk it's trying to process, it will try to highlight it. Also, try running the following query, and then search on some of the resultant words:
Code:

# add your table prefix if needed
SELECT keyword FROM keywords WHERE keyword LIKE '%-%';

What version of PhpDig are you running, and what kinds of search results do you get?

Note that when processing search requests, PhpDig displays the DISPLAY_SNIPPETS_NUM number of snippets, so if you are searching on several words, as soon as PhpDig hits DISPLAY_SNIPPETS_NUM, it quits looking for things to highlight.

Also, if you set DISPLAY_SNIPPETS to false and DISPLAY_SUMMARY to true, PhpDig will not consider DISPLAY_SNIPPETS_NUM and just display the first words of a page, highlighting only if the search words are within the first words of a page.

FaberFedor 02-04-2005 06:18 AM

Quote:
When you removed the dash from the class of characters, you essentially replaced the dash in a word with a space, so if you search on foo-bar, PhpDig will then search on foo and/or bar, not the whole word foo-bar.
/Quote

Okay, that explains why the results highlight "foo bar" and not "foo-bar". There is no "foo-bar" in the database tables.

So how do I get phpDig to index "foo-bar"?

I'm running phpDig 1.8.7.

Charter 02-05-2005 01:39 AM

PhpDig v.1.8.7 should index foo-bar as a word, assuming that the dash is a literal dash and the word foo-bar isn't caught up in some JavaScript. Also, if the hyphened word is longer than MAX_WORDS_SIZE, then it won't get inserted into the database table as a keyword. Try making a demo page with some hyphened words, and after you index it, see if you can search and find the hyphened words.

FaberFedor 02-08-2005 03:31 PM

Quote:

Originally Posted by Charter
Try making a demo page with some hyphened words, and after you index it, see if you can search and find the hyphened words.

Done. Go to http://www.linuxnj.com and click on the "Ignore me" link. That will take you to the page with "Omni-Kuff" in it.

Go to http://www.linuxnj.com/search/search.php and search for "omni-kuff". No go. "omni","kuff" and "omni kuff" will work fine.

I'm going to start whittling the page in question to see if it's something in the page...

Charter 02-08-2005 04:01 PM

Okay, thanks, I see what you mean. I was probably testing using a modified version by mistake. Anyway, in PhpDig v.1.8.7, find the phpdigCleanHtml function in robot_functions.php, look for the following line, and try removing the dash in the character class.
Code:

$text = eregi_replace("[*{}()\"\r\n\t-]+"," ",$text);

FaberFedor 02-08-2005 04:41 PM

Perfect!

Thanks loads!

Now let's see if I fixed the cron problem and everybody will be happy! :-)

djavet 02-22-2005 05:09 AM

Hello

Same problem, bu not resolved.
I've too 1.8.7

Make a search with "0-26-110318-0" (ISBN Number):
http://www.john-howe.com/search/search.php?
template_demo=phpdig.html&result_page=search.php...


The indexed page:
http://www.john-howe.com/portfolio/g...hp?image_id=76
The isbn number is under the pix.

I can find it, but it's not display with the hyphen...
How can I make this, to correct the displayed results?

Regards, Dom

PS: I drop the DB and reindex the site again to be sure, but doesn't see that had something to do with the hyphen case...

Charter 02-22-2005 04:38 PM

http://www.phpdig.net/forum/showpost...13&postcount=7 :confused:

djavet 02-22-2005 11:01 PM

Hello,

Me too i'm confused...
I've on my robot_functions.php around line 147:
Code:

function phpdigCleanHtml($text) {
//htmlentities
global $spec;

//replace blank characters by spaces
$text = eregi_replace("[*{}()\"\r\n\t]+"," ",$text);
//$text = ereg_replace("[\r\n\t]+"," ",$text); // original

without "-" after "\t"!

and around line 138 in search_function.php:
Code:

$what_query_chars = "[^".$phpdig_words_chars[PHPDIG_ENCODING]." \'.\_~@#$:&\%/;,=]+"; // epure chars \'._~@#$:&%/;,=-

if (eregi($what_query_chars,$query_to_parse)) {
        $query_to_parse = eregi_replace($what_query_chars," ",$query_to_parse);
}

$query_to_parse = ereg_replace('(['.$phpdig_words_chars[PHPDIG_ENCODING].'])[\'.\_~@#$:&\%/;,=-]+($|[[:space:]]$|[[:space:]]['.$phpdig_words_chars[PHPDIG_ENCODING].'])','\1 \2',$query_to_parse);

$query_to_parse = trim(ereg_replace(" +"," ",$query_to_parse)); // no more than 1 blank

And nothing!

What I can't understand, it when I'm looking at the temp file in "text_content" folder, it so written without "-":
Code:

...SIBLEY HarperCollinsPublishers
ISBN 0 26 110318 0 September 2, 1994 R****m House Audio: The Two Towers CD...


Can't understand...
A lots of thx for your help and time.

Regards, Dominqiue

Charter 02-22-2005 11:53 PM

This part is correct, no "-" after the "\t".
Quote:

Originally Posted by djavet
I've on my robot_functions.php around line 147:
Code:

function phpdigCleanHtml($text) {
//htmlentities
global $spec;

//replace blank characters by spaces
$text = eregi_replace("[*{}()\"\r\n\t]+"," ",$text);
//$text = ereg_replace("[\r\n\t]+"," ",$text); // original

without "-" after "\t"!

This part is not correct, change it back to the original code.
Quote:

Originally Posted by djavet
and around line 138 in search_function.php:
Code:

$what_query_chars = "[^".$phpdig_words_chars[PHPDIG_ENCODING]." \'.\_~@#$:&\%/;,=]+"; // epure chars \'._~@#$:&%/;,=-

if (eregi($what_query_chars,$query_to_parse)) {
        $query_to_parse = eregi_replace($what_query_chars," ",$query_to_parse);
}

$query_to_parse = ereg_replace('(['.$phpdig_words_chars[PHPDIG_ENCODING].'])[\'.\_~@#$:&\%/;,=-]+($|[[:space:]]$|[[:space:]]['.$phpdig_words_chars[PHPDIG_ENCODING].'])','\1 \2',$query_to_parse);

$query_to_parse = trim(ereg_replace(" +"," ",$query_to_parse)); // no more than 1 blank

And nothing!


djavet 02-23-2005 12:10 AM

Hello,

Thx... but I've always the problem.
I replace search_functions.php with the original 1.8.7 file.
An keep the "robot_functions.php" without "-".

I delete and reindex the page again and in my temp file, I again the ISBN code without "-":
Code:

...SIBLEY HarperCollinsPublishers
ISBN 0 26 110318 0 September 2, 199...

Indexed page:
http://www.john-howe.com/portfolio/g...hp?image_id=76

Sorry, but I'm really confused...

Dom

Charter 02-23-2005 12:28 AM


Spidering in progress... [Stop spider]
SITE : http://www.john-howe.com/
Exclude paths :
- ads/
- cgi-bin/
- fataneh/gallery/admin/
- flash/
- forum/
- guestbook/
- linkchecker/
- links/
- links/admin/
- mailinglist/
- news/pm/
- portfolio/gallery/admin/
- search/
- stuff/gallery/admin/
- webmail/
1:http://www.john-howe.com/portfolio/gallery/details.php?image_id=76
(time : 00:00:13)
No link in temporary table
links found : 1
http://www.john-howe.com/portfolio/gallery/details.php?image_id=76
Optimizing tables...
Indexing complete ! [Back] to admin interface.


Results 1-1, 1 total, on "ISBN" (0.05 seconds)

1. [100.00 %] :// John Howe :: Illustrator [ Portfolio ] / From Hobbiton to Mordor / Gandalf Before the Walls of Minas Tirith
limit to http://www.john-howe.com/, this path : portfolio/gallery/

...994 The Map of Tolkien's Middle-Earth Brian SIBLEY HarperCollinsPublishers ISBN - 0-26-110318-0 September 2, 1994 R****m House Audio: The Two Towers -...


Results 1-1, 1 total, on "0-26-110318-0" (0.02 seconds)

1. [100.00 %] :// John Howe :: Illustrator [ Portfolio ] / From Hobbiton to Mordor / Gandalf Before the Walls of Minas Tirith
limit to http://www.john-howe.com/, this path : portfolio/gallery/

... Map of Tolkien's Middle-Earth Brian SIBLEY HarperCollinsPublishers ISBN - 0-26-110318-0 September 2, 1994 R****m House Audio: The Two Towers - CD fro...

The only thing changed was:
Code:

//replace foo characters by space
$text = eregi_replace("[*{}()\"\r\n\t-]+"," ",$text);

to the following:
Code:

//replace foo characters by space
$text = eregi_replace("[*{}()\"\r\n\t]+"," ",$text);

Put back the original robot_functions.php and then make sure that you only take out the "-" and then delete the page, run the cleans, and then reindex.

djavet 02-23-2005 12:46 AM

:bang: I drop database, folder, all. and I've made a fresh install with only the change into robot_functions.php and... nothing..
Always the damn same!

You're version is 1.8.8 rc1 no? My version is 1.8.7, maybe that's the point...
Dont's know. I'm the only one with that problem with my version?
I can't upgrade to 1.8.8 rc1 due to my host DB version...

A bug into the 1.8.7?

Regards, Dom
PS: I'm really sorry to bother you with that.


All times are GMT -8. The time now is 07:05 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.