01-13-2004, 12:08 AM

I've noticed that PHPDig seems to not be able to differeniate between nearly identical(I say nearly, because they appear identical to my human eyes) documents located on a website.

If one document is located in say /worldwide/ and another in /about_us/ they both come up in a search result with identical percentages.

Additionally, documents that are generated dynamically but are identical also give multiple duplicate results.

Both are listed as results(they differ by the region variable in the URL).

This behavior is understandable, since they are slightly different(from a machines perspective).

However, is there a way to increase the criteria used to judge duplicate documents to filter out highly similar documents as well?

Say if they share 90% of the same content?

For reference, you may see for yourself this behavior at:


Search for "cleaning standards" as a good example.

Several pages into the search, you'll see some examples of pseudo-duplicates.

01-13-2004, 08:00 AM
Hi. You might try modifying the $md5 variable talked about in this (http://www.phpdig.net/showthread.php?threadid=242) thread.