Document near-duplicate detection
First Claim
Patent Images
1. A computer device comprising:
- a memory to store instructions for implementing;
a fingerprint creation component to generate a fingerprint of a particular length for an input document, where, to generate the fingerprint, the fingerprint creation component is to;
sample the input document to obtain a plurality of sampled blocks,generate a set of checksum values from the plurality of sampled blocks, where each checksum value, in the set of checksum values, corresponds to an address of a respective one of a plurality of bits of the fingerprint, andgenerate the fingerprint by flipping a particular bit, of the plurality of bits of the fingerprint, a quantity of times based on a quantity of checksum values, in the set of checksum values, that corresponds to the address of the particular bit; and
a processor to execute the instructions in the memory.
1 Assignment
0 Petitions
Accused Products
Abstract
A near-duplicate component includes a fingerprint creation component and a similarity detection component. The fingerprint creation component receives a document of arbitrary size and generates a compact “fingerprint” that describes the contents of the document. The similarity detection component compares multiple fingerprints based on the hamming distance between the fingerprints. When the hamming distance is below a threshold, the documents can be said to be near-duplicates of one another.
16 Citations
36 Claims
-
1. A computer device comprising:
-
a memory to store instructions for implementing; a fingerprint creation component to generate a fingerprint of a particular length for an input document, where, to generate the fingerprint, the fingerprint creation component is to; sample the input document to obtain a plurality of sampled blocks, generate a set of checksum values from the plurality of sampled blocks, where each checksum value, in the set of checksum values, corresponds to an address of a respective one of a plurality of bits of the fingerprint, and generate the fingerprint by flipping a particular bit, of the plurality of bits of the fingerprint, a quantity of times based on a quantity of checksum values, in the set of checksum values, that corresponds to the address of the particular bit; and a processor to execute the instructions in the memory. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system comprising:
-
one or more computer devices configured to; sample a document to obtain a plurality of sampled blocks, generate a set of checksum values from the plurality of sampled blocks, where each checksum value, in the set of checksum values, identifies a respective one of a plurality of bits of a fingerprint for the document, and generate the fingerprint, for the document, by flipping each bit, of the plurality of bits of the fingerprint, a quantity of times based on a quantity of checksum values, in the set of checksum values, that identify the bit. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. A method, performed by one or more server devices, the method comprising:
-
generating, by one or more processors associated with the one or more server devices, a plurality of checksum values based on a document; reducing, by one or more processors associated with the one or more server devices, lengths of the plurality of checksum values to form a plurality of reduced-length checksum values, where each reduced-length checksum value, of the plurality of reduced-length checksum values, matches an address of a respective one of a plurality of bits of a fingerprint corresponding to the document; and generating, by one or more processors associated with the one or more server devices, the fingerprint, corresponding to the document, by flipping each bit, of the plurality of bits of the fingerprint, based on a quantity of reduced-length checksum values, of the plurality of reduced-length checksum values, that matches the address of the bit. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
-
25. A computer device, comprising:
-
a memory to store instructions for implementing; a fingerprint creation component to generate a fingerprint of a particular length for a document, where, to generate the fingerprint, the fingerprint creation component is to; generate a plurality of checksum values based on the document, reduce lengths of the plurality of checksum values to form a plurality of reduced-length checksum values, where each reduced-length checksum value, of the plurality of reduced-length checksum values, matches an address of a respective one of a plurality of bits of the fingerprint corresponding to the document, and generate the fingerprint, corresponding to the document, by flipping each bit, of the plurality of bits of the fingerprint, based on a quantity of reduced-length checksum values, of the plurality of reduced-length checksum values, that matches the address of the bit; and a processor to execute the instructions in the memory. - View Dependent Claims (26, 27)
-
-
28. A system, comprising:
one or more computer devices to; generate a plurality of checksum values based on a document; reduce lengths of the plurality of checksum values to form a plurality of reduced-length checksum values, where each reduced-length checksum value, of the plurality of reduced-length checksum values, matches an address of a respective one of a plurality of bits of a fingerprint corresponding to the document; and generate the fingerprint, corresponding to the document, by flipping each bit, of the plurality of bits of the fingerprint, based on a quantity of reduced-length checksum values, of the plurality of reduced-length checksum values, that matches the address of the bit. - View Dependent Claims (29, 30)
-
31. A method performed by one or more server devices, the method comprising:
-
sampling an input document to obtain a plurality of sampled blocks; generating a set of checksum values from the plurality of sampled blocks, where each checksum value, in the set of checksum values, corresponds to an address of a respective one of a plurality of bits of a fingerprint for the input document; and generating the fingerprint by flipping a particular bit, of the plurality of bits of the fingerprint, a quantity of times based on a quantity of checksum values, in the set of checksum values, that corresponds to the address of the particular bit. - View Dependent Claims (32, 33)
-
-
34. A method performed by one or more server devices, the method comprising:
-
sampling a document to obtain a plurality of sampled blocks; generating a set of checksum values from the plurality of sampled blocks, where each checksum value, in the set of checksum values, identifies a respective one of a plurality of bits of a fingerprint for the document; and generating the fingerprint, for the document, by flipping each bit, of the plurality of bits of the fingerprint, a quantity of times based on a quantity of checksum values, in the set of checksum values, that identify the bit. - View Dependent Claims (35, 36)
-
Specification