×

Near-duplicate document detection for web crawling

  • US 8,548,972 B1
  • Filed: 03/16/2012
  • Issued: 10/01/2013
  • Est. Priority Date: 03/31/2005
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method performed by one or more devices, the method comprising:

  • determining, by at least one of the one or more devices, a first fingerprint of a first document;

    identifying, by at least one of the one or more devices, a second fingerprint of a second document,the second fingerprint being different than the first fingerprint;

    determining, by at least one of the one or more devices, whether the second fingerprint includes bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the first fingerprint,the sequence of bit positions being fewer than all of the bit positions in the second fingerprint;

    identifying, by at least one of the one or more devices, the first document as a near-duplicate of the second document based on;

    determining that the second fingerprint includes bits in a sequence of bit positions that match bits in a corresponding sequence of bit positions of the first fingerprint, anddetermining that the second fingerprint differs from the first fingerprint in at most k bits,k being an integer greater than zero; and

    processing, by at least one of the one or more devices, the first document or the second document based on identifying the first document as a near-duplicate of the second document.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×