×

Near-duplicate document detection for web crawling

  • US 8,140,505 B1
  • Filed: 03/31/2005
  • Issued: 03/20/2012
  • Est. Priority Date: 03/31/2005
  • Status: Active Grant
First Claim
Patent Images

1. A method performed by one or more devices, the method comprising:

  • crawling, by at least one of the one or more devices, a document;

    determining, by at least one of the one or more devices, a fingerprint for the crawled document;

    identifying, by at least one of the one or more devices, fingerprints of a set of stored fingerprints with bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the fingerprint, where the sequence of bit positions is less than all of the bit positions in the identified fingerprints;

    determining, by at least one of the one or more devices, whether one of the identified fingerprints is substantially similar to the fingerprint based on whether the one of the identified fingerprints differs from the fingerprint in at most k bits, where k is an integer greater than zero; and

    identifying, by at least one of the one or more devices, the crawled document as a near-duplicate of another document when the one of the identified fingerprints is substantially similar to the fingerprint.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×