View Single Post
Old 04-19-2004, 09:10 PM   #1
jerrywin5
Orange Mole
 
Join Date: Mar 2004
Posts: 48
Reduce duplicates in keywords table through more intelligent indexing

When words are indexed, punctuation such as , . : ; ‘ ‘s and ? should be dropped from the end of the word. In addition, words separated with / and – should be indexed as separate words rather than as one word. This will reduce the number of duplicates in the keywords table in the database and allow the spider to matched words to index against the common words list much more accurately. Depending upon the type of search the user employs, search results will be more accurate as well.

When words are indexed, any punctuation following a word without a space in between is treated as part of the word. Therefore, the keywords table in the database is filled with many duplicates that are just variations of the same word. Examples:
following
following,
following:
following;
following.
following?

Other duplicates are created for other reasons.

Words separated with a / to indicate an option such as and/or and boy/girl are indexed as a single word.

Words that end with a ‘ also create duplicates. Example:
bells
bells’

Also, words that include an apostrophe cause duplicates. Example:
bell
bell‘s

Unfortunately, not indexing words that are the same except for an s on the end could lead to indexing errors. Therefore, a certain amount of duplicates will exist.

Words separated with a – also create duplicates. Examples:
Blackberry
like
blackberry-like
bright
pink
bright-pink

It would also be helpful if regular expressions were supported in the common_words.txt file. This would allow you to do something like allow phone numbers and dates but no other numbers or you could exclude all numbers. There is no need to index numbers provided for dimensions, mathematical equations, or chart info. This just bogs down the keyword table with useless data and slows search results.

The result should be a cleaner keywords table, faster search results, and more accurate search results.
jerrywin5 is offline   Reply With Quote