×

System for similar document detection

  • US 7,660,819 B1
  • Filed: 07/31/2000
  • Issued: 02/09/2010
  • Est. Priority Date: 07/31/2000
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method for detecting similar documents using a computer, comprising:

  • obtaining, using the computer, a document;

    parsing, using the computer, the document to remove formatting and to obtain a token stream, the token stream comprising a plurality of tokens;

    retaining, using the computer, only retained tokens in the token stream by using at least one token threshold;

    reordering, using the computer, the retained tokens to obtain an arranged token stream;

    processing, using the computer, in turn each retained token in the arranged token stream using a hash algorithm to obtain a single hash value for the document;

    generating, using the computer, a document identifier for the document;

    forming, using the computer, a single tuple for the document, the tuple comprising the document identifier for the document and the hash value for the document;

    inserting, using the computer, the tuple for the document into a document storage tree, the document storage tree comprising a plurality of tuples, each tuple located at a bucket of the document storage tree, each tuple in the plurality of tuples representing one of a plurality of documents, each tuple in the plurality of tuples comprising a document identifier and a hash value; and

    determining, using the computer, if the tuple for the document is co-located with another tuple at a same bucket in the document storage tree, and detecting if the document is similar to another document represented by the another tuple in a computer readable recording medium storing the document storage tree.

View all claims
  • 21 Assignments
Timeline View
Assignment View
    ×
    ×