×

DETECTING DUPLICATE AND NEAR-DUPLICATE FILES

  • US 20080162478A1
  • Filed: 03/15/2008
  • Published: 07/03/2008
  • Est. Priority Date: 01/24/2001
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for determining whether two documents are near-duplicates, the computer-implemented method comprising:

  • a) for each of the two documents, generating at least two different fingerprints; and

    b) determining whether or not the two documents are near-duplicate documents by1) determining whether or not any one of the at least two fingerprints of a first of the two documents matches any one of the at least two fingerprints of a second of the two documents, and2) if it is determined that any one fingerprint of the at least two fingerprints of the first of the two documents does match any one fingerprint of the at least two fingerprints of the second of the two documents, then concluding that the two documents are near-duplicates; and

    c) using the determination of whether or not the two documents are near-duplicates in at least one of (A) an act of serving search results corresponding to documents, (B) an act of crawling documents, (C) an act of indexing documents, and (D) an act of fixing a broken link to at least one of the two documents.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×