System for similar document detection
First Claim
Patent Images
1. A method for detecting similar documents using a computer, comprising:
- obtaining, using the computer, a document;
parsing, using the computer, the document to remove formatting and to obtain a token stream, the token stream comprising a plurality of tokens;
retaining, using the computer, only retained tokens in the token stream by using at least one token threshold;
reordering, using the computer, the retained tokens to obtain an arranged token stream;
processing, using the computer, in turn each retained token in the arranged token stream using a hash algorithm to obtain a single hash value for the document;
generating, using the computer, a document identifier for the document;
forming, using the computer, a single tuple for the document, the tuple comprising the document identifier for the document and the hash value for the document;
inserting, using the computer, the tuple for the document into a document storage tree, the document storage tree comprising a plurality of tuples, each tuple located at a bucket of the document storage tree, each tuple in the plurality of tuples representing one of a plurality of documents, each tuple in the plurality of tuples comprising a document identifier and a hash value; and
determining, using the computer, if the tuple for the document is co-located with another tuple at a same bucket in the document storage tree, and detecting if the document is similar to another document represented by the another tuple in a computer readable recording medium storing the document storage tree.
21 Assignments
0 Petitions
Accused Products
Abstract
A document is compared to the documents in a document collection using a hash algorithm and collection statistics to detect if the document is similar to any of the documents in the document collection.
58 Citations
9 Claims
-
1. A method for detecting similar documents using a computer, comprising:
-
obtaining, using the computer, a document; parsing, using the computer, the document to remove formatting and to obtain a token stream, the token stream comprising a plurality of tokens; retaining, using the computer, only retained tokens in the token stream by using at least one token threshold; reordering, using the computer, the retained tokens to obtain an arranged token stream; processing, using the computer, in turn each retained token in the arranged token stream using a hash algorithm to obtain a single hash value for the document; generating, using the computer, a document identifier for the document; forming, using the computer, a single tuple for the document, the tuple comprising the document identifier for the document and the hash value for the document; inserting, using the computer, the tuple for the document into a document storage tree, the document storage tree comprising a plurality of tuples, each tuple located at a bucket of the document storage tree, each tuple in the plurality of tuples representing one of a plurality of documents, each tuple in the plurality of tuples comprising a document identifier and a hash value; and determining, using the computer, if the tuple for the document is co-located with another tuple at a same bucket in the document storage tree, and detecting if the document is similar to another document represented by the another tuple in a computer readable recording medium storing the document storage tree. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for detecting similar documents using a computer, comprising:
-
obtaining, using the computer, a document; filtering, using the computer, the document to eliminate tokens based on parts of speech and obtain a filtered document; generating, using the computer, a single tuple for the filtered document; comparing, using the computer, the tuple for the filtered document with a document storage structure comprising a plurality of tuples, each tuple in the plurality of tuples representing one of a plurality of documents; and determining, using the computer, if the tuple for the filtered document is clustered with another tuple in the document storage structure, and detecting if the document is similar to another document represented by the another tuple in a computer readable recording medium storing the document storage structure.
-
-
9. A computer-readable storage medium having a program stored therein for causing a computer to execute operations including detecting similar documents comprising:
-
obtaining a document; filtering the document to eliminate tokens based on parts of speech and obtain a filtered document; generating a single tuple for the filtered document; comparing the tuple for the filtered document with a document storage structure comprising a plurality of tuples, each tuple in the plurality of tuples representing one of a plurality of documents; and determining if the tuple for the filtered document is clustered with another tuple in the document storage structure, based on the comparison, and detecting if the document is similar to another document represented by the another tuple in a computer readable recording medium storing the document storage structure.
-
Specification