PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 01-28-2005, 02:12 PM   #1
FaberFedor
Green Mole
 
Join Date: Jan 2005
Location: New Jersey
Posts: 11
Indexing "<word>-<word>"?

I haven't found this in the docs or the FAQs (or anywhere else for that matter) so I'm asking here.

How do I get PHPDig to index two (or more) words with a hyphen in them as one search-item (as opposed to two seach-items)?

For example: the web page contains "foo-bar". After indexing, I can search for "foo", "bar", "foo bar" but NOT "foo-bar". I'd like to be able to search for "foo-bar" as well.

Suggestions?
FaberFedor is offline   Reply With Quote
Old 01-31-2005, 08:30 AM   #2
FaberFedor
Green Mole
 
Join Date: Jan 2005
Location: New Jersey
Posts: 11
Here's what I've found so far:

According to the docs, dashes (and many other special characters) are allowed in indexes and searches since v1.8. Yet, in
phpdig_functions.php there is a function called phpdigEpureText() that seems to be removing the special characters that the docs say are allowed.

Ho, ho! There is also an entry in search_function.php that removes various characters from the search functionality! If you also remove the dash from $what_query_chars in this file and reindex, you can now search for words with dashes in them!

At least it worked for me.
FaberFedor is offline   Reply With Quote
Old 02-02-2005, 07:42 PM   #3
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
The $what_query_chars variable negates a class of characters; the same goes for the phpdigEpureText function:
Code:
$what_query_chars = "[^".$phpdig_words_chars[PHPDIG_ENCODING]." \'.\_~@#$:&\%/;,=-]+";

if (eregi($what_query_chars,$query_to_parse)) {
	$query_to_parse = eregi_replace($what_query_chars," ",$query_to_parse);
}

$text = ereg_replace('[^'.$phpdig_words_chars[$encoding].' \'._~@#$:&%/;,=-]+',' ',$text);
When you removed the dash from the class of characters, you essentially replaced the dash in a word with a space, so if you search on foo-bar, PhpDig will then search on foo and/or bar, not the whole word foo-bar.

Try searching on t-shirts in the online demo. When PhpDig finds a word containing a dash in the chunk it's trying to process, it will try to highlight it. Also, try running the following query, and then search on some of the resultant words:
Code:
# add your table prefix if needed
SELECT keyword FROM keywords WHERE keyword LIKE '%-%';
What version of PhpDig are you running, and what kinds of search results do you get?

Note that when processing search requests, PhpDig displays the DISPLAY_SNIPPETS_NUM number of snippets, so if you are searching on several words, as soon as PhpDig hits DISPLAY_SNIPPETS_NUM, it quits looking for things to highlight.

Also, if you set DISPLAY_SNIPPETS to false and DISPLAY_SUMMARY to true, PhpDig will not consider DISPLAY_SNIPPETS_NUM and just display the first words of a page, highlighting only if the search words are within the first words of a page.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-04-2005, 06:18 AM   #4
FaberFedor
Green Mole
 
Join Date: Jan 2005
Location: New Jersey
Posts: 11
Quote:
When you removed the dash from the class of characters, you essentially replaced the dash in a word with a space, so if you search on foo-bar, PhpDig will then search on foo and/or bar, not the whole word foo-bar.
/Quote

Okay, that explains why the results highlight "foo bar" and not "foo-bar". There is no "foo-bar" in the database tables.

So how do I get phpDig to index "foo-bar"?

I'm running phpDig 1.8.7.
FaberFedor is offline   Reply With Quote
Old 02-05-2005, 01:39 AM   #5
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
PhpDig v.1.8.7 should index foo-bar as a word, assuming that the dash is a literal dash and the word foo-bar isn't caught up in some JavaScript. Also, if the hyphened word is longer than MAX_WORDS_SIZE, then it won't get inserted into the database table as a keyword. Try making a demo page with some hyphened words, and after you index it, see if you can search and find the hyphened words.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-08-2005, 03:31 PM   #6
FaberFedor
Green Mole
 
Join Date: Jan 2005
Location: New Jersey
Posts: 11
Quote:
Originally Posted by Charter
Try making a demo page with some hyphened words, and after you index it, see if you can search and find the hyphened words.
Done. Go to http://www.linuxnj.com and click on the "Ignore me" link. That will take you to the page with "Omni-Kuff" in it.

Go to http://www.linuxnj.com/search/search.php and search for "omni-kuff". No go. "omni","kuff" and "omni kuff" will work fine.

I'm going to start whittling the page in question to see if it's something in the page...
FaberFedor is offline   Reply With Quote
Old 02-08-2005, 04:01 PM   #7
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Okay, thanks, I see what you mean. I was probably testing using a modified version by mistake. Anyway, in PhpDig v.1.8.7, find the phpdigCleanHtml function in robot_functions.php, look for the following line, and try removing the dash in the character class.
Code:
$text = eregi_replace("[*{}()\"\r\n\t-]+"," ",$text);
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-08-2005, 04:41 PM   #8
FaberFedor
Green Mole
 
Join Date: Jan 2005
Location: New Jersey
Posts: 11
Perfect!

Thanks loads!

Now let's see if I fixed the cron problem and everybody will be happy! :-)
FaberFedor is offline   Reply With Quote
Old 02-22-2005, 05:09 AM   #9
djavet
Orange Mole
 
Join Date: Jan 2005
Posts: 31
Hello

Same problem, bu not resolved.
I've too 1.8.7

Make a search with "0-26-110318-0" (ISBN Number):
http://www.john-howe.com/search/search.php?
template_demo=phpdig.html&result_page=search.php...


The indexed page:
http://www.john-howe.com/portfolio/g...hp?image_id=76
The isbn number is under the pix.

I can find it, but it's not display with the hyphen...
How can I make this, to correct the displayed results?

Regards, Dom

PS: I drop the DB and reindex the site again to be sure, but doesn't see that had something to do with the hyphen case...
djavet is offline   Reply With Quote
Old 02-22-2005, 04:38 PM   #10
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
http://www.phpdig.net/forum/showpost...13&postcount=7
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-22-2005, 11:01 PM   #11
djavet
Orange Mole
 
Join Date: Jan 2005
Posts: 31
Hello,

Me too i'm confused...
I've on my robot_functions.php around line 147:
Code:
function phpdigCleanHtml($text) {
//htmlentities
global $spec;

//replace blank characters by spaces
$text = eregi_replace("[*{}()\"\r\n\t]+"," ",$text);
//$text = ereg_replace("[\r\n\t]+"," ",$text); // original
without "-" after "\t"!

and around line 138 in search_function.php:
Code:
$what_query_chars = "[^".$phpdig_words_chars[PHPDIG_ENCODING]." \'.\_~@#$:&\%/;,=]+"; // epure chars \'._~@#$:&%/;,=-

if (eregi($what_query_chars,$query_to_parse)) {
	$query_to_parse = eregi_replace($what_query_chars," ",$query_to_parse);
}

$query_to_parse = ereg_replace('(['.$phpdig_words_chars[PHPDIG_ENCODING].'])[\'.\_~@#$:&\%/;,=-]+($|[[:space:]]$|[[:space:]]['.$phpdig_words_chars[PHPDIG_ENCODING].'])','\1 \2',$query_to_parse);

$query_to_parse = trim(ereg_replace(" +"," ",$query_to_parse)); // no more than 1 blank
And nothing!

What I can't understand, it when I'm looking at the temp file in "text_content" folder, it so written without "-":
Code:
...SIBLEY HarperCollinsPublishers
ISBN 0 26 110318 0 September 2, 1994 R****m House Audio: The Two Towers CD...

Can't understand...
A lots of thx for your help and time.

Regards, Dominqiue
djavet is offline   Reply With Quote
Old 02-22-2005, 11:53 PM   #12
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
This part is correct, no "-" after the "\t".
Quote:
Originally Posted by djavet
I've on my robot_functions.php around line 147:
Code:
function phpdigCleanHtml($text) {
//htmlentities
global $spec;

//replace blank characters by spaces
$text = eregi_replace("[*{}()\"\r\n\t]+"," ",$text);
//$text = ereg_replace("[\r\n\t]+"," ",$text); // original
without "-" after "\t"!
This part is not correct, change it back to the original code.
Quote:
Originally Posted by djavet
and around line 138 in search_function.php:
Code:
$what_query_chars = "[^".$phpdig_words_chars[PHPDIG_ENCODING]." \'.\_~@#$:&\%/;,=]+"; // epure chars \'._~@#$:&%/;,=-

if (eregi($what_query_chars,$query_to_parse)) {
	$query_to_parse = eregi_replace($what_query_chars," ",$query_to_parse);
}

$query_to_parse = ereg_replace('(['.$phpdig_words_chars[PHPDIG_ENCODING].'])[\'.\_~@#$:&\%/;,=-]+($|[[:space:]]$|[[:space:]]['.$phpdig_words_chars[PHPDIG_ENCODING].'])','\1 \2',$query_to_parse);

$query_to_parse = trim(ereg_replace(" +"," ",$query_to_parse)); // no more than 1 blank
And nothing!
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-23-2005, 12:10 AM   #13
djavet
Orange Mole
 
Join Date: Jan 2005
Posts: 31
Hello,

Thx... but I've always the problem.
I replace search_functions.php with the original 1.8.7 file.
An keep the "robot_functions.php" without "-".

I delete and reindex the page again and in my temp file, I again the ISBN code without "-":
Code:
...SIBLEY HarperCollinsPublishers
ISBN 0 26 110318 0 September 2, 199...
Indexed page:
http://www.john-howe.com/portfolio/g...hp?image_id=76

Sorry, but I'm really confused...

Dom
djavet is offline   Reply With Quote
Old 02-23-2005, 12:28 AM   #14
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539

Spidering in progress... [Stop spider]
SITE : http://www.john-howe.com/
Exclude paths :
- ads/
- cgi-bin/
- fataneh/gallery/admin/
- flash/
- forum/
- guestbook/
- linkchecker/
- links/
- links/admin/
- mailinglist/
- news/pm/
- portfolio/gallery/admin/
- search/
- stuff/gallery/admin/
- webmail/
1:http://www.john-howe.com/portfolio/gallery/details.php?image_id=76
(time : 00:00:13)
No link in temporary table
links found : 1
http://www.john-howe.com/portfolio/gallery/details.php?image_id=76
Optimizing tables...
Indexing complete ! [Back] to admin interface.


Results 1-1, 1 total, on "ISBN" (0.05 seconds)

1. [100.00 %] :// John Howe :: Illustrator [ Portfolio ] / From Hobbiton to Mordor / Gandalf Before the Walls of Minas Tirith
limit to http://www.john-howe.com/, this path : portfolio/gallery/

...994 The Map of Tolkien's Middle-Earth Brian SIBLEY HarperCollinsPublishers ISBN - 0-26-110318-0 September 2, 1994 R****m House Audio: The Two Towers -...


Results 1-1, 1 total, on "0-26-110318-0" (0.02 seconds)

1. [100.00 %] :// John Howe :: Illustrator [ Portfolio ] / From Hobbiton to Mordor / Gandalf Before the Walls of Minas Tirith
limit to http://www.john-howe.com/, this path : portfolio/gallery/

... Map of Tolkien's Middle-Earth Brian SIBLEY HarperCollinsPublishers ISBN - 0-26-110318-0 September 2, 1994 R****m House Audio: The Two Towers - CD fro...

The only thing changed was:
Code:
//replace foo characters by space
$text = eregi_replace("[*{}()\"\r\n\t-]+"," ",$text);
to the following:
Code:
//replace foo characters by space
$text = eregi_replace("[*{}()\"\r\n\t]+"," ",$text);
Put back the original robot_functions.php and then make sure that you only take out the "-" and then delete the page, run the cleans, and then reindex.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-23-2005, 12:46 AM   #15
djavet
Orange Mole
 
Join Date: Jan 2005
Posts: 31
I drop database, folder, all. and I've made a fresh install with only the change into robot_functions.php and... nothing..
Always the damn same!

You're version is 1.8.8 rc1 no? My version is 1.8.7, maybe that's the point...
Dont's know. I'm the only one with that problem with my version?
I can't upgrade to 1.8.8 rc1 due to my host DB version...

A bug into the 1.8.7?

Regards, Dom
PS: I'm really sorry to bother you with that.
djavet is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Selective Indexing of URL Containing a <keyword> Leith How-to Forum 0 01-21-2008 02:16 AM
<!-- phpdigInclude --> and <!-- phpdigExclude --> doesn`t work Paka76 How-to Forum 0 12-06-2005 05:44 AM
search for "hold" not matching on the word foothold mingus Troubleshooting 2 06-02-2004 08:54 PM
Instructions for use <!-- phpdigExclude --> and <!-- phpdigInclude --> maquido How-to Forum 1 06-02-2004 03:36 AM
< phpdigInclude > oliviert Troubleshooting 12 05-19-2004 02:13 AM


All times are GMT -8. The time now is 12:54 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.