×

Reliability of Duplicate Document Detection Algorithms

  • US 20130007026A1
  • Filed: 09/13/2012
  • Published: 01/03/2013
  • Est. Priority Date: 02/11/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method for use in detecting a duplicate document, the method comprising:

  • generating a primary lexicon of attributes and a secondary lexicon of attributes;

    determining unique attributes in a document;

    determining an intersection between the unique attributes in the document and the primary lexicon;

    determining whether the intersection exceeds a threshold;

    when the intersection does not exceed the threshold, adding attributes from the secondary lexicon that intersect with the unique attributes in the document to the intersection to create an augmented intersection that exceeds the threshold; and

    calculating a signature for the document based on the augmented intersection.

View all claims
  • 5 Assignments
Timeline View
Assignment View
    ×
    ×