RELIABILITY OF DUPLICATE DOCUMENT DETECTION ALGORITHMS
8 Assignments
0 Petitions
Accused Products
Abstract
In a single-signature duplicate document system, a secondary set of attributes is used in addition to a primary set of attributes so as to improve the precision of the system. When the projection of a document onto the primary set of attributes is below a threshold, then a secondary set of attributes is used to supplement the primary lexicon so that the projection is above the threshold.
-
Citations
21 Claims
-
1. (canceled)
-
2. A method comprising:
-
accessing a first data item; accessing a primary lexicon of attributes and a secondary lexicon of attributes; determining attributes of the first data item; determining an intersection between the determined attributes and the primary lexicon of attributes; determining whether the intersection is sufficient to meet a target precision for duplicate detection; if the intersection is sufficient to meet the target precision for duplicate detection, determining a signature for the first data item based on the intersection; if the intersection is not sufficient to meet the target precision for duplicate detection, adding attributes from the secondary lexicon that intersect with the attributes of the first data item to create an augmented intersection that is sufficient to meet the target precision and determining a signature for the first data item based on the augmented intersection; comparing the determined signature to a signature of a second data item; and determining whether the first data item is a duplicate of the second data item based on the comparison. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer-usable medium having a computer program embodied thereon, the computer program comprising instructions for causing a processor to perform the following operations:
-
access a first data item; access a primary lexicon of attributes and a secondary lexicon of attributes; determine attributes of the first data item; determine an intersection between the determined attributes and the primary lexicon of attributes; determine whether the intersection is sufficient to meet a target precision for duplicate detection; if the intersection is sufficient to meet the target precision for duplicate detection, determine a signature for the first data item based on the intersection; if the intersection is not sufficient to meet the target precision for duplicate detection, add attributes from the secondary lexicon that intersect with the attributes of the first data item to create an augmented intersection that is sufficient to meet the target precision and determine a signature for the first data item based on the augmented intersection; compare the determined signature to a signature of a second data item; and determine whether the first data item is a duplicate of the second data item based on the comparison. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
-
Specification