Near-duplicate document detection for web crawling

US 8,140,505 B1
Filed: 03/31/2005
Issued: 03/20/2012
Est. Priority Date: 03/31/2005
Status: Active Grant

First Claim

Patent Images

1. A method performed by one or more devices, the method comprising:

crawling, by at least one of the one or more devices, a document;

determining, by at least one of the one or more devices, a fingerprint for the crawled document;

identifying, by at least one of the one or more devices, fingerprints of a set of stored fingerprints with bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the fingerprint, where the sequence of bit positions is less than all of the bit positions in the identified fingerprints;

determining, by at least one of the one or more devices, whether one of the identified fingerprints is substantially similar to the fingerprint based on whether the one of the identified fingerprints differs from the fingerprint in at most k bits, where k is an integer greater than zero; and

identifying, by at least one of the one or more devices, the crawled document as a near-duplicate of another document when the one of the identified fingerprints is substantially similar to the fingerprint.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.

Citations

27 Claims

1. A method performed by one or more devices, the method comprising:
- crawling, by at least one of the one or more devices, a document;
  
  determining, by at least one of the one or more devices, a fingerprint for the crawled document;
  
  identifying, by at least one of the one or more devices, fingerprints of a set of stored fingerprints with bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the fingerprint, where the sequence of bit positions is less than all of the bit positions in the identified fingerprints;
  
  determining, by at least one of the one or more devices, whether one of the identified fingerprints is substantially similar to the fingerprint based on whether the one of the identified fingerprints differs from the fingerprint in at most k bits, where k is an integer greater than zero; and
  
  identifying, by at least one of the one or more devices, the crawled document as a near-duplicate of another document when the one of the identified fingerprints is substantially similar to the fingerprint.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein determining a fingerprint for the crawled document includes:
    - extracting a set of features from the crawled document,assigning a weight to each of the features, anddetermining the fingerprint for the crawled document based on the set of features and the assigned weights.
  - 3. The method of claim 2, wherein the set of features do not include stop words in the crawled document.
  - 4. The method of claim 2, wherein a weight assigned to a more important one of the features is higher than a weight assigned to a less important one of the features.
  - 5. The method of claim 2, wherein a feature with a higher assigned weight influences the fingerprint more than a feature with a lower assigned weight.
  - 6. The method of claim 1, further comprising:
    - storing the set of stored fingerprints in each of a plurality of tables.
  - 7. The method of claim 6, wherein storing the set of stored fingerprints in each of a plurality of tables includes:
    - applying a plurality of permutations to each of the fingerprints in the set of stored fingerprints, andstoring, in each of the tables, the set of stored fingerprints to which a corresponding one of the permutations has been applied.
  - 8. The method of claim 1, further comprising:
    - ignoring outgoing links in the crawled document when the crawled document is identified as a near-duplicate of another document.
  - 9. The method of claim 1, further comprising:
    - identifying the crawled document as not being a near-duplicate of another document when none of the identified fingerprints is substantially similar to the fingerprint; and
      
      extracting outgoing links from the crawled document for crawling.
  - 10. The method of claim 9, further comprising:
    - storing the fingerprint with the set of stored fingerprints when the crawled document is identified as not being a near-duplicate of another document.

11. A system implemented within one or more computer devices, comprising:
- means for generating a hash value for a fetched document;
  
  means for comparing the hash value with a set of stored hash values to identify one or more of the stored hash values with bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the hash value, where the sequence of bit positions is less than all of the bit positions in the identified hash values;
  
  means for determining that one of the identified hash values is substantially similar to the hash value when the one of the identified hash values differs from the hash value in less than a threshold number of bits; and
  
  means for identifying the fetched document as a near-duplicate of another document when the one of the identified hash values is substantially similar to the hash value.

12. A system comprising:
- a memory to store a plurality of fingerprints associated with a plurality of previously-crawled documents; and
  
  a processor to;
  
  crawl a document,generate a fingerprint for the crawled document,apply a number of permutations to the fingerprint to generate a plurality of permuted fingerprints,for each of the permuted fingerprints, determine whether bits associated with one of the fingerprints in the memory differ from bits associated with the permuted fingerprint in at most k bit positions, where k is an integer greater than zero, andidentify the crawled document as a near-duplicate of one of the previously-crawled documents corresponding to the one of the fingerprints when the bits associated with the one of the fingerprints differ from the bits associated with one of the permuted fingerprints in at most the k bit positions.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 13. The system of claim 12, wherein when generating a fingerprint for the crawled document, the processor is configured to:
    - extract a set of features from the crawled document,assign a weight to each of the features, anddetermine the fingerprint for the crawled document based on the set of features and the assigned weights.
  - 14. The system of claim 13, wherein the set of features do not include stop words in the crawled document.
  - 15. The system of claim 13, wherein a weight assigned to a more important one of the features is higher than a weight assigned to a less important one of the features.
  - 16. The system of claim 13, wherein one of the features with a higher assigned weight influences the fingerprint more than another one of the features with a lower assigned weight.
  - 17. The system of claim 12, wherein the memory includes a plurality of tables;
    - andwherein the processor is further configured to;
      
      apply a plurality of permutations to each of the fingerprints associated with the plurality of previously-crawled documents, andstore, in each of the tables, the fingerprints to which a corresponding one of the permutations has been applied.
  - 18. The system of claim 12, wherein the processor is further configured to:
    - identify one or more of the fingerprints in the memory that include a string of bits that matches a corresponding string of bits of each of the permuted fingerprints.
  - 19. The system of claim 12, wherein the processor is further configured to:
    - ignore outgoing links in the crawled document when the crawled document is identified as a near-duplicate of one of the previously-crawled documents.
  - 20. The system of claim 12, wherein the processor is further configured to:
    - identify the crawled document as not being a near-duplicate of one of the previously-crawled documents when none of the identified fingerprints is substantially similar to one of the permuted fingerprints, andextract outgoing links from the crawled document for crawling.
  - 21. The system of claim 20, wherein the processor is further configured to:
    - store the fingerprint in the memory when the crawled document is identified as not being a near-duplicate of one of the previously-crawled documents.

22. A system comprising:
- a first table to store a plurality of fingerprints associated with a plurality of documents;
  
  a coordinator to;
  
  receive a plurality of new documents,determine fingerprints for the new documents, andbroadcast the fingerprints; and
  
  a plurality of devices to receive the broadcast fingerprints and maintain the first table, where each of the devices is configured to;
  
  create a second table based on the broadcast fingerprints,determine whether a fingerprint in a set of the fingerprints from the first table is substantially similar to one of the broadcast fingerprints in the second table by;
  
  identifying ones of the broadcast fingerprints, in the second table, that include a sequence of bits that match a corresponding sequence of bits of the fingerprint, anddetermining whether bits, associated with one of the identified broadcast fingerprints in the second table, differ from bits associated with the fingerprint in at most k bit positions, where k is an integer greater than zero, andidentify the one of the broadcast fingerprints as being associated with a near-duplicate document when a fingerprint in the set of the fingerprints from the first table is substantially similar to the one of the broadcast fingerprints in the second table.
- View Dependent Claims (23, 24, 25, 26)
- - 23. The system of claim 22, wherein when receiving a plurality of new documents, the coordinator is configured to fetch the plurality of new documents during a crawl operation.
  - 24. The system of claim 22, wherein the second table includes a plurality of second tables;
    - andwherein when creating a second table, each of the devices is configured to;
      
      apply a plurality of permutations to each of the broadcast fingerprints, andstore, in each of the tables, the broadcast fingerprints to which a corresponding one of the permutations has been applied.
  - 25. The system of claim 22, wherein the coordinator is further configured to:
    - receive information from the devices regarding at least one of the broadcast fingerprints that is associated with a near-duplicate document, anddiscard the near-duplicate document without extracting outgoing links in the near-duplicate document.
  - 26. The system of claim 25, wherein the coordinator is further configured to:
    - receive information from the devices regarding at least one of the broadcast fingerprints that is associated with a near-duplicate document, andadd at least one other one of the broadcast fingerprints to the first table.

27. A computer-readable memory device that includes computer-executable instructions, comprising:
- instructions for fetching a document;
  
  instructions for generating a fingerprint for the fetched document;
  
  instructions for applying a plurality of permutations to the fingerprint to generate a plurality of permuted fingerprints;
  
  instructions for comparing each of the permuted fingerprints to a set of stored fingerprints, corresponding to a plurality of other documents, to identify one or more of the stored fingerprints that include bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the permuted fingerprint, where the sequence of bit positions is less than all of the bit positions in the identified fingerprints;
  
  instructions for determining whether one of the identified fingerprints differs from one of the permuted fingerprints in less than a threshold number of bits; and
  
  instructions for identifying the fetched document as a near-duplicate of one of the other documents when the one of the identified fingerprints differs from the one of the permuted fingerprints in less than the threshold number of bits.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Jain, Arvind, Manku, Gurmeet Singh
Primary Examiner(s)
Rones, Charles
Assistant Examiner(s)
Choi, Yuk Ting

Application Number

US11/094,791
Time in Patent Office

2,546 Days
Field of Search

None
US Class Current

707/706
CPC Class Codes

G06F 16/9014 hash tables

G06F 16/951 Indexing; Web crawling tech...

Near-duplicate document detection for web crawling

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Near-duplicate document detection for web crawling

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links