Methods and systems to efficiently find similar and near-duplicate emails and files
First Claim
1. A method for generating and using a semantic space in a computer system, comprising:
- receiving, via at least one computer processor, a plurality of documents and a plurality of random document vectors, wherein each random document vector in the plurality of random document vectors is generated based on random indexing and is associated with a corresponding document in the plurality of documents;
selecting, via the at least one computer processor, a term for which to generate a term vector;
for a first document in the plurality of documents, determining, via the at least one computer processor, if the term appears in the document;
if the term appears in the document;
determining, via the at least one computer processor, a frequency of the term in the document; and
adding, via the at least one computer processor, an associated random document vector of the document to the term vector, wherein the associated random document vector is scaled by the term frequencydetermining, via the at least one computer processor, if the term appears in any remaining documents in the plurality of documents;
if the term does not appear in any remaining documents in the plurality of documents, generating, via the at least one computer processor, a normalized version of the term vector with the added associated random document vector;
outputting, via the at least one computer processor, the normalized term vector;
receiving a query; and
generating, based at least in part on the semantic space including the normalized version of the term vector, a user interface displaying similar and near-duplicate documents.
8 Assignments
0 Petitions
Accused Products
Abstract
A set of trigrams can be generated for each document in a plurality of documents processed by an e-discovery system. Each trigram in the set of trigrams for a given document is a sequence of three terms in the given document. A set of trigrams for each similar document is then determined based on the set of trigrams for the original document. To facilitate identification of the similar documents, a full text index is then generated for the plurality of documents and the set of trigrams for each document are indexed into the full text index, as individual terms. Queries can be generated into the full text index based on trigrams of a document to determine other similar or near-duplicate documents. After a set of potentially similar documents are identified, a separate distance criteria can be applied to evaluate the level of similarity between the two documents in an efficient way.
-
Citations
20 Claims
-
1. A method for generating and using a semantic space in a computer system, comprising:
-
receiving, via at least one computer processor, a plurality of documents and a plurality of random document vectors, wherein each random document vector in the plurality of random document vectors is generated based on random indexing and is associated with a corresponding document in the plurality of documents; selecting, via the at least one computer processor, a term for which to generate a term vector; for a first document in the plurality of documents, determining, via the at least one computer processor, if the term appears in the document; if the term appears in the document; determining, via the at least one computer processor, a frequency of the term in the document; and adding, via the at least one computer processor, an associated random document vector of the document to the term vector, wherein the associated random document vector is scaled by the term frequency determining, via the at least one computer processor, if the term appears in any remaining documents in the plurality of documents; if the term does not appear in any remaining documents in the plurality of documents, generating, via the at least one computer processor, a normalized version of the term vector with the added associated random document vector; outputting, via the at least one computer processor, the normalized term vector; receiving a query; and generating, based at least in part on the semantic space including the normalized version of the term vector, a user interface displaying similar and near-duplicate documents. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for generating and using a semantic space in a computer system, comprising one or more computer processors configured to:
-
receive a plurality of documents and a plurality of random document vectors, wherein each random document vector in the plurality of random document vectors is generated based on random indexing and is associated with a corresponding document in the plurality of documents; select a term for which to generate a term vector; for a first document in the plurality of documents, determine if the term appears in the document; and if the term appears in the document; determine a frequency of the term in the document; add an associated random document vector of the document to the term vector, wherein the associated random document vector is scaled by the term frequency; determine if the term appears in any remaining documents in the plurality of documents; if the term does not appear in any remaining documents in the plurality of documents, generate a normalized version of the term vector with the added associated random document vector; output the normalized term vector; receiving a query; and generating, based at least in part on the semantic space including the normalized version of the term vector, a user interface displaying similar and near-duplicate documents. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. An article of manufacture for generating and using a semantic space in a computer system, the article of manufacture comprising:
-
at least one processor readable storage medium; and instructions stored on the at least one medium; wherein the instructions are configured to be readable from the at least one medium by at least one processor and thereby cause the at least one processor to operate so as to; receive a plurality of documents and a plurality of random document vectors, wherein each random document vector in the plurality of random document vectors is generated based on random indexing and is associated with a corresponding document in the plurality of documents; select a term for which to generate a term vector; for a first document in the plurality of documents, determine if the term appears in the document; and if the term appears in the document; determine a frequency of the term in the document; add an associated random document vector of the document to the term vector, wherein the associated random document vector is scaled by the term frequency determine if the term appears in any remaining documents in the plurality of documents; if the term does not appear in any remaining documents in the plurality of documents, generate a normalized version of the term vector with the added associated random document vector; and output the normalized term vector; receive a query; and generate, based at least in part on the semantic space including the normalized version of the term vector, a user interface displaying similar and near-duplicate documents. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification