Method for determining the resemblance of documents
First Claim
1. A method of determining the resemblance of a plurality of documents comprising the steps of:
- reducing a first of the plurality of documents into a first sequence of tokens;
reducing a second of the plurality of documents into a second sequence of tokens;
converting the first sequence of tokens to a first set of shingles;
converting the second sequence of tokens to a second set of shingles;
determining a first sketch of the first document from the first set of shingles;
determining a second sketch of the second document from the second set of shingles; and
comparing the first sketch and the second sketch so as to determine the resemblance between the first document and the second document.
12 Assignments
0 Petitions
Accused Products
Abstract
A method for facilitating the comparison of two computerized documents. The method includes loading a first document into a random access memory (RAM), loading a second document into the RAM, reducing the first document into a first sequence of tokens, reducing the second document into a second sequence of tokens, converting the first set of tokens to a first (multi)set of shingles, converting the second set of tokens to a second (multi)set of shingles, determining a first sketch of the first (multi)set of shingles, determining a second sketch of the second (multi)set of shingles, and comparing the first sketch and the second sketch. The sketches have a fixed size, independent of the size of the documents. The resemblance of two documents is provided using a sketch of each document. The sketches may be computed fairly fast and given two sketches the resemblance of the corresponding documents can be computed in linear time in the size of the sketches.
182 Citations
30 Claims
-
1. A method of determining the resemblance of a plurality of documents comprising the steps of:
-
reducing a first of the plurality of documents into a first sequence of tokens; reducing a second of the plurality of documents into a second sequence of tokens; converting the first sequence of tokens to a first set of shingles; converting the second sequence of tokens to a second set of shingles; determining a first sketch of the first document from the first set of shingles; determining a second sketch of the second document from the second set of shingles; and comparing the first sketch and the second sketch so as to determine the resemblance between the first document and the second document. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method for determining the resemblance of a plurality of documents each including a respective plurality of elements, the method comprising the steps of:
-
generating a first representation of a first of the plurality of documents, the first representation including at least one proper subset of the elements of the first document; generating a second representation of a second of the plurality of documents, the second representation including at least one proper subset of the elements of the second document; and comparing the first representation and the second representation so as to determine the resemblance between the first document and the second document. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
-
-
15. A system for determining the resemblance of a plurality of documents each including a respective plurality of elements, the system comprising:
-
at least one memory for storing the plurality of documents; and at least one processor coupled to the at least one memory configured to; generate a first representation of a first of the plurality of documents, the first representation including at least one proper subset of the elements of the first document; generate a second representation of a second of the plurality of documents, the second representation including at least one proper subset of the elements of the second document; and compare the first representation and the second representation so as to determine the resemblance between the first document and the second document. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
-
-
23. An article of manufacture for determining the resemblance of a plurality of documents each including a respective plurality of elements, the article of manufacture comprising:
-
a computer readable storage medium; and computer programming stored on the storage medium;
wherein the stored computer programming is configured to be readable from the computer readable storage medium by at least one computer and thereby cause the at least one computer to operate so as to;generate a first representation of a first of the plurality of documents, the first representation including at least one proper subset of the elements of the first document; generate a second representation of a second of the plurality of documents, the second representation including at least one proper subset of the elements of the second document; and compare the first representation and the second representation so as to determine the resemblance between the first document and the second document. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30)
-
Specification