×

Reliability of duplicate document detection algorithms

  • US 7,984,029 B2
  • Filed: 06/23/2008
  • Issued: 07/19/2011
  • Est. Priority Date: 02/11/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method comprising:

  • accessing a first data item;

    accessing a primary lexicon of attributes and a secondary lexicon of attributes;

    determining, by at least one processor, attributes of the first data item;

    determining, by at least one processor, an intersection between the determined attributes and the primary lexicon of attributes;

    determining, by at least one processor, whether the intersection is sufficient to meet a target precision for duplicate detection;

    if the intersection is sufficient to meet the target precision for duplicate detection, determining a signature for the first data item based on the intersection;

    if the intersection is not sufficient to meet the target precision for duplicate detection, adding attributes from the secondary lexicon that intersect with the attributes of the first data item to create an augmented intersection that is sufficient to meet the target precision and determining a signature for the first data item based on the augmented intersection;

    comparing the determined signature to a signature of a second data item; and

    determining whether the first data item is a duplicate of the second data item based on the comparison.

View all claims
  • 8 Assignments
Timeline View
Assignment View
    ×
    ×