Document near-duplicate detection
First Claim
Patent Images
1. A method performed by one or more computer devices, the method comprising:
- generating, by at least one of the one or more computer devices, a fingerprint for an input document, where generating the fingerprint includes;
sampling the input document to obtain a plurality of sampled blocks,generating a set of checksum values from the plurality of sampled blocks, where each checksum value, in the set of checksum values, corresponds to an address of a respective one of a plurality of bits of the fingerprint, andgenerating the fingerprint by flipping a particular bit, of the plurality of bits of the fingerprint, a quantity of times based on a quantity of checksum values, in the set of checksum values, that corresponds to the address of the particular bit; and
storing, by at least one of the one or more computing devices, the fingerprint in a memory.
1 Assignment
0 Petitions
Accused Products
Abstract
A near-duplicate component includes a fingerprint creation component and a similarity detection component. The fingerprint creation component receives a document of arbitrary size and generates a compact “fingerprint” that describes the contents of the document. The similarity detection component compares multiple fingerprints based on the hamming distance between the fingerprints. When the hamming distance is below a threshold, the documents can be said to be near-duplicates of one another.
-
Citations
61 Claims
-
1. A method performed by one or more computer devices, the method comprising:
-
generating, by at least one of the one or more computer devices, a fingerprint for an input document, where generating the fingerprint includes; sampling the input document to obtain a plurality of sampled blocks, generating a set of checksum values from the plurality of sampled blocks, where each checksum value, in the set of checksum values, corresponds to an address of a respective one of a plurality of bits of the fingerprint, and generating the fingerprint by flipping a particular bit, of the plurality of bits of the fingerprint, a quantity of times based on a quantity of checksum values, in the set of checksum values, that corresponds to the address of the particular bit; and storing, by at least one of the one or more computing devices, the fingerprint in a memory. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A non-transitory computer-readable memory device that stores instructions executable by one or more computer devices, the computer-readable memory device comprising:
-
one or more instructions to generate a fingerprint for an input document, where the one or more instructions to generate the fingerprint include; one or more instructions to sample the input document to obtain a plurality of sampled blocks, one or more instructions to generate a set of checksum values from the plurality of sampled blocks, where each checksum value, in the set of checksum values, corresponds to an address of a respective one of a plurality of bits of the fingerprint, and one or more instructions to generate the fingerprint by flipping a particular bit, of the plurality of bits of the fingerprint, a quantity of times based on a quantity of checksum values, in the set of checksum values, that corresponds to the address of the particular bit; and one or more instructions to store the fingerprint. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A method performed by one or more computer devices, the method comprising:
-
sampling, by at least one of the one or more computer devices, a document to obtain a plurality of sampled blocks; generating, by at least one of the one or more computer devices, a set of checksum values from the plurality of sampled blocks, where each checksum value, in the set of checksum values, identifies a respective one of a plurality of bits of a fingerprint for the document; and generating, by at least one of the one or more computer devices, the fingerprint, for the document, by flipping each bit, of the plurality of bits of the fingerprint, a quantity of times based on a quantity of checksum values, in the set of checksum values, that identify the bit. - View Dependent Claims (20, 21, 22, 23, 24)
-
-
25. A non-transitory computer-readable memory device that stores instructions executable by one or more computer devices, the computer-readable memory device comprising:
-
one or more instructions to sample a document to obtain a plurality of sampled blocks; one or more instructions to generate a set of checksum values from the plurality of sampled blocks, where each checksum value, in the set of checksum values, identifies a respective one of a plurality of bits of a fingerprint for the document; and one or more instructions to generate the fingerprint, for the document, by flipping each bit, of the plurality of bits of the fingerprint, a quantity of times based on a quantity of checksum values, in the set of checksum values, that identify the bit. - View Dependent Claims (26, 27, 28, 29, 30)
-
-
31. A system, comprising:
one or more computer devices configured to; generate a plurality of checksum values based on a document; reduce lengths of the plurality of checksum values to form a plurality of reduced-length checksum values, where each reduced-length checksum value, of the plurality of reduced-length checksum values, matches an address of a respective one of a plurality of bits of a fingerprint corresponding to the document; and generate the fingerprint, corresponding to the document, by flipping each bit, of the plurality of bits of the fingerprint, based on a quantity of reduced-length checksum values, of the plurality of reduced-length checksum values, that matches the address of the bit. - View Dependent Claims (32, 33, 34, 35, 36, 37, 38)
-
39. A non-transitory computer-readable memory device that stores instructions executable by one or more computer devices, the computer-readable memory device comprising:
-
one or more instructions to generate a plurality of checksum values based on a document; one or more instructions to reduce lengths of the plurality of checksum values to form a plurality of reduced-length checksum values, where each reduced-length of a respective one of a plurality of bits of a fingerprint corresponding to the document; and one or more instructions to generate the fingerprint, corresponding to the document, by flipping each bit, of the plurality of bits of the fingerprint, based on a quantity of reduced-length checksum values, of the plurality of reduced-length checksum values, that matches the address of the bit. - View Dependent Claims (40, 41, 42, 43, 44, 45, 46)
-
-
47. A method performed by one or more computer devices, the method comprising:
-
sampling, by at least one of the one or more computer devices, an input document to obtain a plurality of sampled blocks; computing, by at least one of the one or more computer devices, a set of checksum values from the plurality of sampled blocks, where each checksum value, in the set of checksum values, corresponds to an address of a respective one of a plurality of bits of a fingerprint corresponding to the input document; setting, by at least one of the one or more computer devices, a particular bit, of the plurality of bits of the fingerprint, to a particular value to generate the fingerprint, the particular bit being set to the particular value based on a quantity of checksum values, in the set of checksum values, that corresponds to the address of the particular bit; and storing, in a memory, the fingerprint as a representation of the input document. - View Dependent Claims (48, 49, 50, 51)
-
-
52. A system, comprising:
-
one or more computer devices configured to; sample an input document to obtain a plurality of sampled blocks; compute a set of checksum values from the plurality of sampled blocks, where each checksum value, in the set of checksum values, corresponds to an address of a respective one of a plurality of bits of a fingerprint corresponding to the input document; set a particular bit, of the plurality of bits of the fingerprint, to a particular value to generate the fingerprint, the particular bit being set to the particular value based on a quantity of checksum values, in the set of checksum values, that corresponds to the address of the particular bit; and store, in a memory, the fingerprint as a representation of the input document. - View Dependent Claims (53, 54, 55, 56)
-
-
57. A non-transitory computer-readable memory device storing instructions executable by one or more computer devices, the instructions comprising:
-
one or more instructions to sample an input document to obtain a plurality of sampled blocks; one or more instructions to compute a set of checksum values from the plurality of sampled blocks, where each checksum value, in the set of checksum values, corresponds to an address of a respective one of a plurality of bits of a fingerprint corresponding to the input document; one or more instructions to set a particular bit, of the plurality of bits of the fingerprint, to a particular value to generate the fingerprint, the particular bit being set to the particular value based on a quantity of checksum values, in the set of checksum values, that corresponds to the address of the particular bit; and one or more instructions to store the fingerprint as a representation of the input document. - View Dependent Claims (58, 59, 60, 61)
-
Specification