×

Detecting duplicate and near-duplicate files

  • US 8,015,162 B2
  • Filed: 08/04/2006
  • Issued: 09/06/2011
  • Est. Priority Date: 08/04/2006
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method comprising:

  • crawling documents accessible on a network to identify a set of documents, each document in the set of documents comprising a set of token sequence bit strings;

    processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique, wherein the first document similarity technique is token order dependent and token frequency independent;

    processing the first set of near-duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique, wherein the second document similarity technique is token order independent and token frequency dependent, and the second set of near-duplicate documents are determined based on a first threshold value;

    processing the set of documents to identify a third set of near-duplicate documents using the second document similarity technique, and wherein the third set of near-duplicate documents are identified based on a second threshold value greater than the first threshold value; and

    removing a final set of near-duplicate documents from the set of documents, and then indexing any remaining documents in the set of documents, wherein the final set of near-duplicate documents is a union of the second set of near-duplicate documents and the third set of near-duplicate documents.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×