×

Decreasing the fragility of duplicate document detecting algorithms

  • US 7,624,274 B1
  • Filed: 12/21/2004
  • Issued: 11/24/2009
  • Est. Priority Date: 02/11/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method used in detecting a duplicate document, the method comprising:

  • generating at least a first and second lexicon of attributes, the second lexicon being generated by eliminating a random fraction of attributes from the first lexicon independently of a characteristic of the attributes in the first lexicon;

    determining unique attributes in a document;

    determining a first set of intersection attributes based on an intersection between the unique attributes in the document and the first lexicon;

    determining a second set of intersection attributes based on an intersection between the unique attributes in the document and the second lexicon;

    calculating a first sub-signature based on the first set of intersection attributes;

    calculating a second sub-signature based on the second set of intersection attributions;

    setting a signature of the document to include the first and second sub-signatures; and

    accessing a second signature associated with a second document, the second signature including a third sub-signature calculated based on an intersection between unique attributes in the second document and the first lexicon, and a fourth sub-signature calculated based on an intersection between unique attributes in the second document and the second lexicon; and

    if the first sub-signature matches the third sub-signature or if the second sub-signature matches the fourth sub-signature, determining that the document and the second document are duplicates.

View all claims
  • 7 Assignments
Timeline View
Assignment View
    ×
    ×