Reduce duplicates in keywords table through more intelligent indexing [Archive]

jerrywin5

04-19-2004, 09:10 PM

When words are indexed, punctuation such as , . : ; ‘ ‘s and ? should be dropped from the end of the word. In addition, words separated with / and – should be indexed as separate words rather than as one word. This will reduce the number of duplicates in the keywords table in the database and allow the spider to matched words to index against the common words list much more accurately. Depending upon the type of search the user employs, search results will be more accurate as well.

When words are indexed, any punctuation following a word without a space in between is treated as part of the word. Therefore, the keywords table in the database is filled with many duplicates that are just variations of the same word. Examples:
following
following,
following:
following;
following.
following?

Other duplicates are created for other reasons.

Words separated with a / to indicate an option such as and/or and boy/girl are indexed as a single word.

Words that end with a ‘ also create duplicates. Example:
bells
bells’

Also, words that include an apostrophe cause duplicates. Example:
bell
bell‘s

Unfortunately, not indexing words that are the same except for an s on the end could lead to indexing errors. Therefore, a certain amount of duplicates will exist.

Words separated with a – also create duplicates. Examples:
Blackberry
like
blackberry-like
bright
pink
bright-pink

It would also be helpful if regular expressions were supported in the common_words.txt file. This would allow you to do something like allow phone numbers and dates but no other numbers or you could exclude all numbers. There is no need to index numbers provided for dimensions, mathematical equations, or chart info. This just bogs down the keyword table with useless data and slows search results.

The result should be a cleaner keywords table, faster search results, and more accurate search results.

Charter

04-20-2004, 08:06 AM

Hi. The punctuation is on there for exact matches, but perhaps this exact match is too exact.

To relax the exact match and drop a lot, but not all, of the punctuation from the end of a word, do the following.

In phpdig_functions.php find:

$text = ereg_replace('[^'.$phpdig_words_chars[$enco
ding].' \\'._~@#$:&%/;,=-]+',' ',$text);

and afterwards add:

$text = ereg_replace('(['.$phpdig_words_chars[$enco
ding].'])[\\'._~@#$:&%/;,=-]+($|[[:space:]]$|[[:spa
ce:]]['.$phpdig_words_chars[$encoding].'])','\1\2',$text);

In search_function.php find:

if (eregi($what_query_chars,$query_to_parse)) {
$query_to_parse = eregi_replace($what_que
ry_chars," ",$query_to_parse);
}

and afterwards add:

$query_to_parse = ereg_replace('(['.$phpdig_words_chars[PHPDIG_EN
CODING].'])[\\'.\_~@#$:&\%/;,=-]+($|[[:space:]]$|[[:space:]]['.$ph
pdig_words_chars[PHPDIG_ENCODING].'])','\1\2',$query_to_parse);

In search_function.php find:

if ($option == "exact") { // there are two instances of this

In both of these two if statements find:

$reg_strings = str_replace('@#@',' ',phpdigPregQuotes(str_repl
ace('\\\','',implode('@#@', $query_for_phrase_array))));

and replace with:

$reg_strings = str_replace('@#@','.* ',phpdigPregQuotes(str_repl
ace('\\\','',implode('@#@', $query_for_phrase_array))));

For breaking on a / or - character, see this (http://www.phpdig.net/showthread.php?threadid=200) thread. To exclude certain numbers, add a regex to the following line in robot_functions.php:

if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WOR
DS_SIZE and !isset($common_words[$key]) and ereg('^['.$phpdig_wo
rds_chars[PHPDIG_ENCODING].'#$]',$key))

For the changes to take effect, a new index would need to be done. Remember to remove any "word" wrapping from the above code.