siliconkibou
01-13-2004, 12:08 AM
Hi,
I've noticed that PHPDig seems to not be able to differeniate between nearly identical(I say nearly, because they appear identical to my human eyes) documents located on a website.
If one document is located in say /worldwide/ and another in /about_us/ they both come up in a search result with identical percentages.
Additionally, documents that are generated dynamically but are identical also give multiple duplicate results.
For example:
http://www.issa.com/worldwide/index.jsp?region=9&type=news&id=153
and
http://www.issa.com/worldwide/index.jsp?region=11&type=news&id=153
Both are listed as results(they differ by the region variable in the URL).
This behavior is understandable, since they are slightly different(from a machines perspective).
However, is there a way to increase the criteria used to judge duplicate documents to filter out highly similar documents as well?
Say if they share 90% of the same content?
Thanks in advance,
-Paul
For reference, you may see for yourself this behavior at:
http://search.custodialadvisorsnetwork.org
Search for "cleaning standards" as a good example.
Several pages into the search, you'll see some examples of pseudo-duplicates.
I've noticed that PHPDig seems to not be able to differeniate between nearly identical(I say nearly, because they appear identical to my human eyes) documents located on a website.
If one document is located in say /worldwide/ and another in /about_us/ they both come up in a search result with identical percentages.
Additionally, documents that are generated dynamically but are identical also give multiple duplicate results.
For example:
http://www.issa.com/worldwide/index.jsp?region=9&type=news&id=153
and
http://www.issa.com/worldwide/index.jsp?region=11&type=news&id=153
Both are listed as results(they differ by the region variable in the URL).
This behavior is understandable, since they are slightly different(from a machines perspective).
However, is there a way to increase the criteria used to judge duplicate documents to filter out highly similar documents as well?
Say if they share 90% of the same content?
Thanks in advance,
-Paul
For reference, you may see for yourself this behavior at:
http://search.custodialadvisorsnetwork.org
Search for "cleaning standards" as a good example.
Several pages into the search, you'll see some examples of pseudo-duplicates.