Methods and systems to efficiently find similar and near-duplicate emails and files

US 10,083,176 B1
Filed: 02/29/2016
Issued: 09/25/2018
Est. Priority Date: 01/23/2006
Status: Active Grant

First Claim

Patent Images

1. A method for generating and using a semantic space in a computer system, comprising:

receiving, via at least one computer processor, a plurality of documents and a plurality of random document vectors, wherein each random document vector in the plurality of random document vectors is generated based on random indexing and is associated with a corresponding document in the plurality of documents;

selecting, via the at least one computer processor, a term for which to generate a term vector;

for a first document in the plurality of documents, determining, via the at least one computer processor, if the term appears in the document;

if the term appears in the document;

determining, via the at least one computer processor, a frequency of the term in the document; and

adding, via the at least one computer processor, an associated random document vector of the document to the term vector, wherein the associated random document vector is scaled by the term frequencydetermining, via the at least one computer processor, if the term appears in any remaining documents in the plurality of documents;

if the term does not appear in any remaining documents in the plurality of documents, generating, via the at least one computer processor, a normalized version of the term vector with the added associated random document vector;

outputting, via the at least one computer processor, the normalized term vector;

receiving a query; and

generating, based at least in part on the semantic space including the normalized version of the term vector, a user interface displaying similar and near-duplicate documents.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A set of trigrams can be generated for each document in a plurality of documents processed by an e-discovery system. Each trigram in the set of trigrams for a given document is a sequence of three terms in the given document. A set of trigrams for each similar document is then determined based on the set of trigrams for the original document. To facilitate identification of the similar documents, a full text index is then generated for the plurality of documents and the set of trigrams for each document are indexed into the full text index, as individual terms. Queries can be generated into the full text index based on trigrams of a document to determine other similar or near-duplicate documents. After a set of potentially similar documents are identified, a separate distance criteria can be applied to evaluate the level of similarity between the two documents in an efficient way.

Citations

20 Claims

1. A method for generating and using a semantic space in a computer system, comprising:
- receiving, via at least one computer processor, a plurality of documents and a plurality of random document vectors, wherein each random document vector in the plurality of random document vectors is generated based on random indexing and is associated with a corresponding document in the plurality of documents;
  
  selecting, via the at least one computer processor, a term for which to generate a term vector;
  
  for a first document in the plurality of documents, determining, via the at least one computer processor, if the term appears in the document;
  
  if the term appears in the document;
  
  determining, via the at least one computer processor, a frequency of the term in the document; and
  
  adding, via the at least one computer processor, an associated random document vector of the document to the term vector, wherein the associated random document vector is scaled by the term frequencydetermining, via the at least one computer processor, if the term appears in any remaining documents in the plurality of documents;
  
  if the term does not appear in any remaining documents in the plurality of documents, generating, via the at least one computer processor, a normalized version of the term vector with the added associated random document vector;
  
  outputting, via the at least one computer processor, the normalized term vector;
  
  receiving a query; and
  
  generating, based at least in part on the semantic space including the normalized version of the term vector, a user interface displaying similar and near-duplicate documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the term is selected from a plurality of terms from a corpus of the plurality of documents.
  - 3. The method of claim 2, wherein the plurality of terms do not include terms with low Inverse Document Frequency and do not include terms with low global term frequency associated with the plurality of documents.
  - 4. The method of claim 2, wherein the plurality of terms do not include language specific characters.
  - 5. The method of claim 2, wherein the plurality of terms do not include terms with low global term frequency associated with the plurality of documents.
  - 6. The method of claim 1, wherein the plurality of documents are partitioned and each partition is processed independently.
  - 7. The method of claim 1, wherein at least one of the plurality of random document vectors include a plurality of floating point values.

8. A system for generating and using a semantic space in a computer system, comprising one or more computer processors configured to:
- receive a plurality of documents and a plurality of random document vectors, wherein each random document vector in the plurality of random document vectors is generated based on random indexing and is associated with a corresponding document in the plurality of documents;
  
  select a term for which to generate a term vector;
  
  for a first document in the plurality of documents, determine if the term appears in the document; and
  
  if the term appears in the document;
  
  determine a frequency of the term in the document;
  
  add an associated random document vector of the document to the term vector, wherein the associated random document vector is scaled by the term frequency;
  
  determine if the term appears in any remaining documents in the plurality of documents;
  
  if the term does not appear in any remaining documents in the plurality of documents, generate a normalized version of the term vector with the added associated random document vector;
  
  output the normalized term vector;
  
  receiving a query; and
  
  generating, based at least in part on the semantic space including the normalized version of the term vector, a user interface displaying similar and near-duplicate documents.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the term is selected from a plurality of terms from a corpus of the plurality of documents.
  - 10. The system of claim 9, wherein the plurality of terms do not include terms with low Inverse Document Frequency.
  - 11. The system of claim 9, wherein the plurality of terms do not include language specific characters.
  - 12. The system of claim 9, wherein the plurality of terms do not include terms with low global term frequency associated with the plurality of documents.
  - 13. The system of claim 8, wherein the plurality of documents are partitioned and each partition is processed independently.
  - 14. The system of claim 8, wherein at least one of the plurality of random document vectors include a plurality of floating point values.

15. An article of manufacture for generating and using a semantic space in a computer system, the article of manufacture comprising:
- at least one processor readable storage medium; and
  
  instructions stored on the at least one medium;
  
  wherein the instructions are configured to be readable from the at least one medium by at least one processor and thereby cause the at least one processor to operate so as to;
  
  receive a plurality of documents and a plurality of random document vectors, wherein each random document vector in the plurality of random document vectors is generated based on random indexing and is associated with a corresponding document in the plurality of documents;
  
  select a term for which to generate a term vector;
  
  for a first document in the plurality of documents, determine if the term appears in the document; and
  
  if the term appears in the document;
  
  determine a frequency of the term in the document;
  
  add an associated random document vector of the document to the term vector, wherein the associated random document vector is scaled by the term frequencydetermine if the term appears in any remaining documents in the plurality of documents;
  
  if the term does not appear in any remaining documents in the plurality of documents, generate a normalized version of the term vector with the added associated random document vector; and
  
  output the normalized term vector;
  
  receive a query; and
  
  generate, based at least in part on the semantic space including the normalized version of the term vector, a user interface displaying similar and near-duplicate documents.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The article of manufacture of claim 15, wherein the term is selected from a plurality of terms from a corpus of the plurality of documents.
  - 17. The article of manufacture of claim 16, wherein the plurality of terms do not include terms with low Inverse Document Frequency and do not include terms with low global term frequency associated with the plurality of documents.
  - 18. The article of manufacture of claim 16, wherein the plurality of terms do not include language specific characters.
  - 19. The article of manufacture of claim 16, wherein the plurality of terms do not include terms with low global term frequency associated with the plurality of documents.
  - 20. The article of manufacture of claim 15, wherein the plurality of documents are partitioned and each partition is processed independently.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Veritas Technologies, LLC (Whitehouse Group Ltd.)
Original Assignee
Veritas Technologies, LLC (Whitehouse Group Ltd.)
Inventors
Desai, Malay, Shewale, Medha, Rangan, Venkat
Primary Examiner(s)
Hasan, Syed Haroon

Application Number

US15/056,616
Time in Patent Office

939 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/31   Indexing; Data structures t...

G06F 16/316   Indexing structures

G06F 16/334   Query execution G06F16/335 ...

G06F 16/3344   using natural language anal...

G06F 16/3347   using vector based model

G06F 16/93   Document management systems

Methods and systems to efficiently find similar and near-duplicate emails and files

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems to efficiently find similar and near-duplicate emails and files

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links