METHODS AND SYSTEMS TO EFFICIENTLY FIND SIMILAR AND NEAR-DUPLICATE EMAILS AND FILES
First Claim
1. A method for generating document signatures, the method comprising:
- receiving, at one or more computer systems, a plurality of documents, each document in the plurality of documents having a plurality of terms;
generating, with one or more processors associated with the one or more computer systems, a first set of trigrams for each document in the plurality of documents, each trigram in the first set of trigrams for a given document in the plurality of documents being a sequence in the given document of three terms in the plurality of terms of the given document;
determining, with one or more processors associated with the one or more computer systems, a second set of trigrams for each document in the plurality of documents based on the first set of trigrams for the document and first filter criteria, the second set of trigrams for a given document in the plurality of documents being a subset of the first set of trigrams for the given document and having one or more trigrams that satisfy the first filter criteria; and
storing, in a storage device associated with the one or more computer systems, the second set of trigrams for each document in the plurality of documents as offsets into the document for each term of each trigram in the second set of trigrams.
8 Assignments
0 Petitions
Accused Products
Abstract
A set of trigrams can be generated for each document in a plurality of documents processed by an e-discovery system. Each trigram in the set of trigrams for a given document is a sequence of three terms in the given document. A set of trigrams for each similar document is then determined based on the set of trigrams for the original document. To facilitate identification of the similar documents, a full text index is then generated for the plurality of documents and the set of trigrams for each document are indexed into the full text index, as individual terms. Queries can be generated into the full text index based on trigrams of a document to determine other similar or near-duplicate documents. After a set of potentially similar documents are identified, a separate distance criteria can be applied to evaluate the level of similarity between the two documents in an efficient way.
-
Citations
20 Claims
-
1. A method for generating document signatures, the method comprising:
-
receiving, at one or more computer systems, a plurality of documents, each document in the plurality of documents having a plurality of terms; generating, with one or more processors associated with the one or more computer systems, a first set of trigrams for each document in the plurality of documents, each trigram in the first set of trigrams for a given document in the plurality of documents being a sequence in the given document of three terms in the plurality of terms of the given document; determining, with one or more processors associated with the one or more computer systems, a second set of trigrams for each document in the plurality of documents based on the first set of trigrams for the document and first filter criteria, the second set of trigrams for a given document in the plurality of documents being a subset of the first set of trigrams for the given document and having one or more trigrams that satisfy the first filter criteria; and storing, in a storage device associated with the one or more computer systems, the second set of trigrams for each document in the plurality of documents as offsets into the document for each term of each trigram in the second set of trigrams. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A non-transitory computer-readable medium storing computer-executable code for generating document signatures, the computer-readable medium comprising:
-
code for receiving a plurality of documents, each document in the plurality of documents having a plurality of terms; code for generating a first set of trigrams for each document in the plurality of documents, each trigram in the first set of trigrams for a given document in the plurality of documents being a sequence in the given document of three terms in the plurality of terms of the given document; code for determining a second set of trigrams for each document in the plurality of documents based on the first set of trigrams for the document and first filter criteria, the second set of trigrams for a given document in the plurality of documents being a subset of the first set of trigrams for the given document and having one or more trigrams that satisfy the first filter criteria; and code for storing the second set of trigrams for each document in the plurality of documents as offsets into the document for each term of each trigram in the second set of trigrams. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. An e-discovery system comprising:
-
a processor; and a memory in communication with the processor and configured to store processor-executable instructions which configured the processor to; receive a plurality of documents, each document in the plurality of documents having a plurality of terms; generate a first set of trigrams for each document in the plurality of documents, each trigram in the first set of trigrams for a given document in the plurality of documents being a sequence in the given document of three terms in the plurality of terms of the given document; determine a second set of trigrams for each document in the plurality of documents based on the first set of trigrams for the document and first filter criteria, the second set of trigrams for a given document in the plurality of documents being a subset of the first set of trigrams for the given document and having one or more trigrams that satisfy the first filter criteria; store the second set of trigrams for each document in the plurality of documents as offsets into the document for each term of each trigram in the second set of trigrams; generate, with the one or more processors associated with the one or more computer systems, a full text index for the plurality of documents; index the second set of trigrams for each document in the plurality of documents into the full text index; receive a first document; determine a set of trigrams associated with the first document; generate a query into the full text index for the plurality of documents based of the set of trigrams associated with the first document; determine a first set of documents in the plurality of documents in response to executing the query on the full text index; and generating one or more user interfaces configured for displaying information identifying selected ones of the first set of documents as substantially similar to the first document. - View Dependent Claims (20)
-
Specification