×

Document near-duplicate detection

  • US 7,707,157 B1
  • Filed: 03/25/2004
  • Issued: 04/27/2010
  • Est. Priority Date: 03/25/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method for generating a fingerprint of a document, performed by one or more server devices, the method comprising:

  • obtaining, by a processor associated with the one or more server devices, a plurality of overlapping blocks by sampling the document;

    generating, by a processor associated with the one or more server devices, a set of checksum values from the plurality of overlapping blocks;

    choosing, by a processor associated with the one or more server devices, a subset of the set of checksum values, where the subset is less than an entirety of the set of checksum values;

    initializing, by a processor associated with the one or more server devices, the fingerprint of the document by setting all bits of the fingerprint to zero;

    addressing, by a processor associated with the one or more server devices, a particular bit of the fingerprint with a particular checksum value; and

    flipping, by a processor associated with the one or more server devices, the particular bit of the fingerprint a number of times corresponding to a number of times the particular checksum value occurs in the subset.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×