Reliability of Duplicate Document Detection Algorithms
First Claim
Patent Images
1. A method for use in detecting a duplicate document, the method comprising:
- generating a primary lexicon of attributes and a secondary lexicon of attributes;
determining unique attributes in a document;
determining an intersection between the unique attributes in the document and the primary lexicon;
determining whether the intersection exceeds a threshold;
when the intersection does not exceed the threshold, adding attributes from the secondary lexicon that intersect with the unique attributes in the document to the intersection to create an augmented intersection that exceeds the threshold; and
calculating a signature for the document based on the augmented intersection.
5 Assignments
0 Petitions
Accused Products
Abstract
In a single-signature duplicate document system, a secondary set of attributes is used in addition to a primary set of attributes so as to improve the precision of the system. When the projection of a document onto the primary set of attributes is below a threshold, then a secondary set of attributes is used to supplement the primary lexicon so that the projection is above the threshold.
-
Citations
1 Claim
-
1. A method for use in detecting a duplicate document, the method comprising:
-
generating a primary lexicon of attributes and a secondary lexicon of attributes; determining unique attributes in a document; determining an intersection between the unique attributes in the document and the primary lexicon; determining whether the intersection exceeds a threshold; when the intersection does not exceed the threshold, adding attributes from the secondary lexicon that intersect with the unique attributes in the document to the intersection to create an augmented intersection that exceeds the threshold; and calculating a signature for the document based on the augmented intersection.
-
Specification