PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Mod Requests

Reply
 
Thread Tools
Old 04-19-2004, 09:10 PM   #1
jerrywin5
Orange Mole
 
Join Date: Mar 2004
Posts: 48
Reduce duplicates in keywords table through more intelligent indexing

When words are indexed, punctuation such as , . : ; ‘ ‘s and ? should be dropped from the end of the word. In addition, words separated with / and – should be indexed as separate words rather than as one word. This will reduce the number of duplicates in the keywords table in the database and allow the spider to matched words to index against the common words list much more accurately. Depending upon the type of search the user employs, search results will be more accurate as well.

When words are indexed, any punctuation following a word without a space in between is treated as part of the word. Therefore, the keywords table in the database is filled with many duplicates that are just variations of the same word. Examples:
following
following,
following:
following;
following.
following?

Other duplicates are created for other reasons.

Words separated with a / to indicate an option such as and/or and boy/girl are indexed as a single word.

Words that end with a ‘ also create duplicates. Example:
bells
bells’

Also, words that include an apostrophe cause duplicates. Example:
bell
bell‘s

Unfortunately, not indexing words that are the same except for an s on the end could lead to indexing errors. Therefore, a certain amount of duplicates will exist.

Words separated with a – also create duplicates. Examples:
Blackberry
like
blackberry-like
bright
pink
bright-pink

It would also be helpful if regular expressions were supported in the common_words.txt file. This would allow you to do something like allow phone numbers and dates but no other numbers or you could exclude all numbers. There is no need to index numbers provided for dimensions, mathematical equations, or chart info. This just bogs down the keyword table with useless data and slows search results.

The result should be a cleaner keywords table, faster search results, and more accurate search results.
jerrywin5 is offline   Reply With Quote
Old 04-20-2004, 08:06 AM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. The punctuation is on there for exact matches, but perhaps this exact match is too exact.

To relax the exact match and drop a lot, but not all, of the punctuation from the end of a word, do the following.

In phpdig_functions.php find:
Code:
$text = ereg_replace('[^'.$phpdig_words_chars[$enco
ding].' \\'._~@#$:&%/;,=-]+',' ',$text);
and afterwards add:
Code:
$text = ereg_replace('(['.$phpdig_words_chars[$enco
ding].'])[\\'._~@#$:&%/;,=-]+($|[[:space:]]$|[[:spa
ce:]]['.$phpdig_words_chars[$encoding].'])','\1\2',$text);
In search_function.php find:
Code:
if (eregi($what_query_chars,$query_to_parse)) {
	$query_to_parse = eregi_replace($what_que
ry_chars," ",$query_to_parse);
}
and afterwards add:
Code:
$query_to_parse = ereg_replace('(['.$phpdig_words_chars[PHPDIG_EN
CODING].'])[\\'.\_~@#$:&\%/;,=-]+($|[[:space:]]$|[[:space:]]['.$ph
pdig_words_chars[PHPDIG_ENCODING].'])','\1\2',$query_to_parse);
In search_function.php find:
Code:
if ($option == "exact") { // there are two instances of this
In both of these two if statements find:
Code:
$reg_strings = str_replace('@#@',' ',phpdigPregQuotes(str_repl
ace('\\\','',implode('@#@', $query_for_phrase_array))));
and replace with:
Code:
$reg_strings = str_replace('@#@','.* ',phpdigPregQuotes(str_repl
ace('\\\','',implode('@#@', $query_for_phrase_array))));
For breaking on a / or - character, see this thread. To exclude certain numbers, add a regex to the following line in robot_functions.php:
Code:
if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WOR
DS_SIZE and !isset($common_words[$key]) and ereg('^['.$phpdig_wo
rds_chars[PHPDIG_ENCODING].'#$]',$key))
For the changes to take effect, a new index would need to be done. Remember to remove any "word" wrapping from the above code.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Not indexing pages, keywords, etc.. patrick@online- Troubleshooting 5 04-15-2006 02:10 AM
keywords duplicates and unwanted keywords jerrywin5 How-to Forum 5 04-06-2005 03:20 PM
excluding keywords from indexing Fking How-to Forum 1 10-05-2004 05:43 PM
Junk in keywords table - Indexing PDF Bege External Binaries 2 04-09-2004 07:15 AM
Reduce number of connections druesome Troubleshooting 1 10-14-2003 07:42 AM


All times are GMT -8. The time now is 09:53 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.