×

SYSTEM AND METHOD FOR NEAR AND EXACT DE-DUPLICATION OF DOCUMENTS

  • US 20090276467A1
  • Filed: 04/30/2008
  • Published: 11/05/2009
  • Est. Priority Date: 04/30/2008
  • Status: Active Grant
First Claim
Patent Images

1. A method for identifying near and exact-duplicate documents in a document collection, the method comprising:

  • for each document in the collection;

    reading textual content from the document;

    filtering the textual content based on user settings;

    determining N most frequent words from the filtered textual content of the document;

    performing a quorum search of the N most frequent words in the document with a threshold M; and

    sorting results from the quorum search based on relevancy,whereby based on the values of N and M near and exact-duplicate documents are identified in the document collection.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×