Near-duplicate document detection for web crawling
First Claim
Patent Images
1. A method performed by one or more devices, the method comprising:
- crawling, by at least one of the one or more devices, a document;
determining, by at least one of the one or more devices, a fingerprint for the crawled document;
identifying, by at least one of the one or more devices, fingerprints of a set of stored fingerprints with bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the fingerprint, where the sequence of bit positions is less than all of the bit positions in the identified fingerprints;
determining, by at least one of the one or more devices, whether one of the identified fingerprints is substantially similar to the fingerprint based on whether the one of the identified fingerprints differs from the fingerprint in at most k bits, where k is an integer greater than zero; and
identifying, by at least one of the one or more devices, the crawled document as a near-duplicate of another document when the one of the identified fingerprints is substantially similar to the fingerprint.
2 Assignments
0 Petitions
Accused Products
Abstract
A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.
-
Citations
27 Claims
-
1. A method performed by one or more devices, the method comprising:
-
crawling, by at least one of the one or more devices, a document; determining, by at least one of the one or more devices, a fingerprint for the crawled document; identifying, by at least one of the one or more devices, fingerprints of a set of stored fingerprints with bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the fingerprint, where the sequence of bit positions is less than all of the bit positions in the identified fingerprints; determining, by at least one of the one or more devices, whether one of the identified fingerprints is substantially similar to the fingerprint based on whether the one of the identified fingerprints differs from the fingerprint in at most k bits, where k is an integer greater than zero; and identifying, by at least one of the one or more devices, the crawled document as a near-duplicate of another document when the one of the identified fingerprints is substantially similar to the fingerprint. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system implemented within one or more computer devices, comprising:
-
means for generating a hash value for a fetched document; means for comparing the hash value with a set of stored hash values to identify one or more of the stored hash values with bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the hash value, where the sequence of bit positions is less than all of the bit positions in the identified hash values; means for determining that one of the identified hash values is substantially similar to the hash value when the one of the identified hash values differs from the hash value in less than a threshold number of bits; and means for identifying the fetched document as a near-duplicate of another document when the one of the identified hash values is substantially similar to the hash value.
-
-
12. A system comprising:
-
a memory to store a plurality of fingerprints associated with a plurality of previously-crawled documents; and a processor to; crawl a document, generate a fingerprint for the crawled document, apply a number of permutations to the fingerprint to generate a plurality of permuted fingerprints, for each of the permuted fingerprints, determine whether bits associated with one of the fingerprints in the memory differ from bits associated with the permuted fingerprint in at most k bit positions, where k is an integer greater than zero, and identify the crawled document as a near-duplicate of one of the previously-crawled documents corresponding to the one of the fingerprints when the bits associated with the one of the fingerprints differ from the bits associated with one of the permuted fingerprints in at most the k bit positions. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A system comprising:
-
a first table to store a plurality of fingerprints associated with a plurality of documents; a coordinator to; receive a plurality of new documents, determine fingerprints for the new documents, and broadcast the fingerprints; and a plurality of devices to receive the broadcast fingerprints and maintain the first table, where each of the devices is configured to; create a second table based on the broadcast fingerprints, determine whether a fingerprint in a set of the fingerprints from the first table is substantially similar to one of the broadcast fingerprints in the second table by; identifying ones of the broadcast fingerprints, in the second table, that include a sequence of bits that match a corresponding sequence of bits of the fingerprint, and determining whether bits, associated with one of the identified broadcast fingerprints in the second table, differ from bits associated with the fingerprint in at most k bit positions, where k is an integer greater than zero, and identify the one of the broadcast fingerprints as being associated with a near-duplicate document when a fingerprint in the set of the fingerprints from the first table is substantially similar to the one of the broadcast fingerprints in the second table. - View Dependent Claims (23, 24, 25, 26)
-
-
27. A computer-readable memory device that includes computer-executable instructions, comprising:
-
instructions for fetching a document; instructions for generating a fingerprint for the fetched document; instructions for applying a plurality of permutations to the fingerprint to generate a plurality of permuted fingerprints; instructions for comparing each of the permuted fingerprints to a set of stored fingerprints, corresponding to a plurality of other documents, to identify one or more of the stored fingerprints that include bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the permuted fingerprint, where the sequence of bit positions is less than all of the bit positions in the identified fingerprints; instructions for determining whether one of the identified fingerprints differs from one of the permuted fingerprints in less than a threshold number of bits; and instructions for identifying the fetched document as a near-duplicate of one of the other documents when the one of the identified fingerprints differs from the one of the permuted fingerprints in less than the threshold number of bits.
-
Specification