Document near-duplicate detection
First Claim
Patent Images
1. A method for generating a fingerprint of a document, performed by one or more server devices, the method comprising:
- obtaining, by a processor associated with the one or more server devices, a plurality of overlapping blocks by sampling the document;
generating, by a processor associated with the one or more server devices, a set of checksum values from the plurality of overlapping blocks;
choosing, by a processor associated with the one or more server devices, a subset of the set of checksum values, where the subset is less than an entirety of the set of checksum values;
initializing, by a processor associated with the one or more server devices, the fingerprint of the document by setting all bits of the fingerprint to zero;
addressing, by a processor associated with the one or more server devices, a particular bit of the fingerprint with a particular checksum value; and
flipping, by a processor associated with the one or more server devices, the particular bit of the fingerprint a number of times corresponding to a number of times the particular checksum value occurs in the subset.
2 Assignments
0 Petitions
Accused Products
Abstract
A near-duplicate component includes a fingerprint creation component and a similarity detection component. The fingerprint creation component receives a document of arbitrary size and generates a compact “fingerprint” that describes the contents of the document. The similarity detection component compares multiple fingerprints based on the hamming distance between the fingerprints. When the hamming distance is below a threshold, the documents can be said to be near-duplicates of one another.
59 Citations
18 Claims
-
1. A method for generating a fingerprint of a document, performed by one or more server devices, the method comprising:
-
obtaining, by a processor associated with the one or more server devices, a plurality of overlapping blocks by sampling the document; generating, by a processor associated with the one or more server devices, a set of checksum values from the plurality of overlapping blocks; choosing, by a processor associated with the one or more server devices, a subset of the set of checksum values, where the subset is less than an entirety of the set of checksum values; initializing, by a processor associated with the one or more server devices, the fingerprint of the document by setting all bits of the fingerprint to zero; addressing, by a processor associated with the one or more server devices, a particular bit of the fingerprint with a particular checksum value; and flipping, by a processor associated with the one or more server devices, the particular bit of the fingerprint a number of times corresponding to a number of times the particular checksum value occurs in the subset. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for generating a fingerprint of a document, performed by one or more server devices, the method comprising:
-
sampling, by a processor associated with the one or more server devices, the document to obtain a plurality of overlapping samples; generating, by a processor associated with the one or more server devices, a set of checksum values from the plurality of overlapping samples; selecting, by a processor associated with the one or more server devices, a subset of the set of checksum values as those of the checksum values corresponding to a predetermined number of smallest checksum values or a predetermined number of largest checksum values; addressing, by a processor associated with the one or more server devices, a particular bit of the fingerprint with a particular checksum value; and flipping, by a processor associated with the one or more server devices, the particular bit of the fingerprint a number of times corresponding to a number of times the particular checksum value occurs in the subset. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A memory device containing program instructions that, when executed by a processor, cause the processor to:
-
sample a document to obtain a plurality of overlapping samples; generate a set of checksum values from the plurality of overlapping samples; select a subset of the set of checksum values as those of the checksum values corresponding to a predetermined number of smallest checksum values or a predetermined number of largest checksum values; addressing bits in the fingerprint with particular values; and flipping a particular bit in the fingerprint a number of times based on a number of checksum values in the subset that correspond to the particular value addressed to the particular bit. - View Dependent Claims (17, 18)
-
Specification