×

Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments

  • US 6,978,419 B1
  • Filed: 11/15/2000
  • Issued: 12/20/2005
  • Est. Priority Date: 11/15/2000
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-assisted method for identifying duplicate and near-duplicate documents in a large collection of documents, comprising the steps of:

  • initially, selecting distinctive features contained in the collection of documents,then, for each document, identifying the distinctive features contained in the document, andthen, for each pair of documents having at least one distinctive feature in common, comparing the distinctive features of the documents to determine whether the documents are duplicate or near-duplicate documents,wherein the distinctive features are text fragments, which are sequences of at least two words that appear in a limited number of documents in the document collection,wherein the text fragments are determined to be distinctive features based upon a function of the frequency of a text fragment within a document in the large collection of documents,wherein for each sequence of at least two words, a distinctiveness score is calculated, and the highest scoring sequences that are found in at least two documents in the document collection are considered distinctive text fragments,wherein the distinctiveness score is the reciprocal of the number of documents containing the text fragment multiplied by a monotonic function of the number of words in the text fragment.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×