Near-duplicate document detection for web crawling
First Claim
Patent Images
1. A method performed by one or more devices, the method comprising:
- determining, by at least one of the one or more devices, a first fingerprint of a first document;
identifying, by at least one of the one or more devices, a second fingerprint of a second document,the second fingerprint being different than the first fingerprint;
determining, by at least one of the one or more devices, whether the second fingerprint includes bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the first fingerprint,the sequence of bit positions being fewer than all of the bit positions in the second fingerprint;
identifying, by at least one of the one or more devices, the first document as a near-duplicate of the second document based on;
determining that the second fingerprint includes bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the first fingerprint, anddetermining that the second fingerprint differs from the first fingerprint in at most k bits,k being an integer greater than zero; and
processing, by at least one of the one or more devices, the first document or the second document based on identifying the first document as a near-duplicate of the second document.
1 Assignment
0 Petitions
Accused Products
Abstract
A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.
-
Citations
20 Claims
-
1. A method performed by one or more devices, the method comprising:
-
determining, by at least one of the one or more devices, a first fingerprint of a first document; identifying, by at least one of the one or more devices, a second fingerprint of a second document, the second fingerprint being different than the first fingerprint; determining, by at least one of the one or more devices, whether the second fingerprint includes bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the first fingerprint, the sequence of bit positions being fewer than all of the bit positions in the second fingerprint; identifying, by at least one of the one or more devices, the first document as a near-duplicate of the second document based on; determining that the second fingerprint includes bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the first fingerprint, and determining that the second fingerprint differs from the first fingerprint in at most k bits, k being an integer greater than zero; and processing, by at least one of the one or more devices, the first document or the second document based on identifying the first document as a near-duplicate of the second document. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A device comprising:
-
a processor; and a memory including a plurality of instructions that, when executed by the processor, cause the processor to; determine a first fingerprint of a first document; identify a second fingerprint of a second document, the second fingerprint being different than the first fingerprint; determine whether the second fingerprint includes bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the first fingerprint, the sequence of bit positions being fewer than all of the bit positions in the second fingerprint; identify the first document as a near-duplicate of the second document, the processor, when identifying the first document as a near-duplicate of the second document, being to; determine that the second fingerprint includes bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the first fingerprint, and determine that the second fingerprint differs from the first fingerprint in at most k bits, k being an integer greater than zero; and process the first document or the second document based on identifying the first document as a near-duplicate of the second document. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable medium storing instructions, the instructions comprising:
-
one or more instructions which, when executed by at least one processor, cause the at least one processor to determine a first fingerprint of a first document; one or more instructions which, when executed by the at least one processor, cause the at least one processor to identify a second fingerprint of a second document, the second fingerprint being different than the first fingerprint; one or more instructions which, when executed by the at least one processor, cause the at least one processor to identify the first document as a near-duplicate of the second document, the one or more instructions to identify the first document as a near-duplicate of the second document including; one or more instructions to determine that the second fingerprint includes bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the first fingerprint, and one or more instructions to determine that the second fingerprint differs from the first fingerprint in at most k bits, k being an integer greater than zero; and one or more instructions which, when executed by the at least one processor, cause the at least one processor to process the first document or the second document based on identifying the first document as a near-duplicate of the second document. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification