METHODS AND SYSTEMS TO EFFICIENTLY FIND SIMILAR AND NEAR-DUPLICATE EMAILS AND FILES

US 20120209853A1
Filed: 02/16/2011
Published: 08/16/2012
Est. Priority Date: 01/23/2006
Status: Active Grant

First Claim

Patent Images

1. A method for generating document signatures, the method comprising:

receiving, at one or more computer systems, a plurality of documents, each document in the plurality of documents having a plurality of terms;

generating, with one or more processors associated with the one or more computer systems, a first set of trigrams for each document in the plurality of documents, each trigram in the first set of trigrams for a given document in the plurality of documents being a sequence in the given document of three terms in the plurality of terms of the given document;

determining, with one or more processors associated with the one or more computer systems, a second set of trigrams for each document in the plurality of documents based on the first set of trigrams for the document and first filter criteria, the second set of trigrams for a given document in the plurality of documents being a subset of the first set of trigrams for the given document and having one or more trigrams that satisfy the first filter criteria; and

storing, in a storage device associated with the one or more computer systems, the second set of trigrams for each document in the plurality of documents as offsets into the document for each term of each trigram in the second set of trigrams.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A set of trigrams can be generated for each document in a plurality of documents processed by an e-discovery system. Each trigram in the set of trigrams for a given document is a sequence of three terms in the given document. A set of trigrams for each similar document is then determined based on the set of trigrams for the original document. To facilitate identification of the similar documents, a full text index is then generated for the plurality of documents and the set of trigrams for each document are indexed into the full text index, as individual terms. Queries can be generated into the full text index based on trigrams of a document to determine other similar or near-duplicate documents. After a set of potentially similar documents are identified, a separate distance criteria can be applied to evaluate the level of similarity between the two documents in an efficient way.

Citations

20 Claims

1. A method for generating document signatures, the method comprising:
- receiving, at one or more computer systems, a plurality of documents, each document in the plurality of documents having a plurality of terms;
  
  generating, with one or more processors associated with the one or more computer systems, a first set of trigrams for each document in the plurality of documents, each trigram in the first set of trigrams for a given document in the plurality of documents being a sequence in the given document of three terms in the plurality of terms of the given document;
  
  determining, with one or more processors associated with the one or more computer systems, a second set of trigrams for each document in the plurality of documents based on the first set of trigrams for the document and first filter criteria, the second set of trigrams for a given document in the plurality of documents being a subset of the first set of trigrams for the given document and having one or more trigrams that satisfy the first filter criteria; and
  
  storing, in a storage device associated with the one or more computer systems, the second set of trigrams for each document in the plurality of documents as offsets into the document for each term of each trigram in the second set of trigrams.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, further comprising:
    - generating, with the one or more processors associated with the one or more computer systems, a full text index for the plurality of documents; and
      
      indexing, with the one or more processors associated with the one or more computer systems, the second set of trigrams for each document in the plurality of documents into the full text index.
  - 3. The method of claim 2, further comprising:
    - generating, with the one or more processors associated with the one or more computer systems, an artificial distance between at least some of the trigrams in the second set of trigrams for each document in the plurality of documents prior to indexing into the full text index.
  - 4. The method of claim 1, further comprising:
    - receiving, at the one or more computer systems, a first document;
      
      determining, with the one or more processors associated with the one or more computer systems, a set of trigrams associated with the first document;
      
      generating, with the one or more processors associated with the one or more computer systems, a query into a full text index for the plurality of documents indexing the second set of trigrams for each document in the plurality of documents based of the set of trigrams associated with the first document; and
      
      determining, with the one or more processors associated with the one or more computer systems, a first set of documents in the plurality of documents in response to executing the query on the full text index indexing the second set of trigrams for each document in the plurality of documents.
  - 5. The method of claim 4 wherein determining, with the one or more processors associated with the one or more computer systems, the first set of documents in the plurality of documents in response to executing the query on the full text index indexing the second set of trigrams for each document in the plurality of documents comprises determining the first set of documents based on second filter criteria.
  - 6. The method of claim 4 wherein determining, with the one or more processors associated with the one or more computer systems, the first set of documents in the plurality of documents in response to executing the query on the full text index indexing the second set of trigrams for each document in the plurality of documents comprises determining the first set of documents as a subset of a second set of documents returned in response to executing the query based on a cosine similarity between the first document and each document in the second set of documents.
  - 7. The method of claim 1, further comprising:
    - generating, with the one or more processors associated with the one or more computer systems, one or more user interfaces configured for displaying information identifying selected ones of the plurality of documents as substantially similar to a first document selected via the one or more user interfaces in response to a query into a full text index for the plurality of documents indexing the second set of trigrams for each document in the plurality of documents based of a set of trigrams associated with the first document.
  - 8. The method of claim 1, further comprising:
    - generating, with the one or more processors associated with the one or more computer systems, an attachment match bitmap identifying which parts of selected ones of the plurality of documents correspond to an actual match with a first document, the selected ones of the plurality of documents identified as substantially similar to the first document based of a set of trigrams associated with the first document and a query into a full text index for the plurality of documents indexing the second set of trigrams for each document in the plurality of documents; and
      
      generating, with the one or more processors associated with the one or more computer systems, one or more user interfaces configured for displaying information identifying the selected ones of the plurality of documents as substantially similar to the first document and displaying which parts of the selected ones of the plurality of documents correspond to an actual match with the first document.
  - 9. The method of claim 1 wherein determining, with the one or more processors associated with the one or more computer systems, the second set of trigrams for each document in the plurality of documents based on the first set of trigrams for the document and first filter criteria comprises filtering the first set of trigrams for each document to determine the second set of trigrams for the document as a predetermined number of most frequently occurring trigrams.

10. A non-transitory computer-readable medium storing computer-executable code for generating document signatures, the computer-readable medium comprising:
- code for receiving a plurality of documents, each document in the plurality of documents having a plurality of terms;
  
  code for generating a first set of trigrams for each document in the plurality of documents, each trigram in the first set of trigrams for a given document in the plurality of documents being a sequence in the given document of three terms in the plurality of terms of the given document;
  
  code for determining a second set of trigrams for each document in the plurality of documents based on the first set of trigrams for the document and first filter criteria, the second set of trigrams for a given document in the plurality of documents being a subset of the first set of trigrams for the given document and having one or more trigrams that satisfy the first filter criteria; and
  
  code for storing the second set of trigrams for each document in the plurality of documents as offsets into the document for each term of each trigram in the second set of trigrams.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The computer-readable medium of claim 10, further comprising:
    - code for generating a full text index for the plurality of documents; and
      
      code for indexing the second set of trigrams for each document in the plurality of documents into the full text index.
  - 12. The computer-readable medium of claim 11, further comprising:
    - code for generating an artificial distance between at least some of the trigrams in the second set of trigrams for each document in the plurality of documents prior to indexing into the full text index.
  - 13. The computer-readable medium of claim 10, further comprising:
    - code for receiving a first document;
      
      code for determining a set of trigrams associated with the first document;
      
      code for generating a query into a full text index for the plurality of documents indexing the second set of trigrams for each document in the plurality of documents based of the set of trigrams associated with the first document; and
      
      code for determining a first set of documents in the plurality of documents in response to executing the query on the full text index indexing the second set of trigrams for each document in the plurality of documents.
  - 14. The computer-readable medium of claim 13 wherein the code for determining the first set of documents in the plurality of documents in response to executing the query on the full text index indexing the second set of trigrams for each document in the plurality of documents comprises code for determining the first set of documents based on second filter criteria.
  - 15. The computer-readable medium of claim 13 wherein the code for determining the first set of documents in the plurality of documents in response to executing the query on the full text index indexing the second set of trigrams for each document in the plurality of documents comprises code for determining the first set of documents as a subset of a second set of documents returned in response to executing the query based on a cosine similarity between the first document and each document in the second set of documents.
  - 16. The computer-readable medium of claim 10, further comprising:
    - code for generating one or more user interfaces configured for displaying information identifying selected ones of the plurality of documents as substantially similar to a first document selected via the one or more user interfaces in response to a query into a full text index for the plurality of documents indexing the second set of trigrams for each document in the plurality of documents based of a set of trigrams associated with the first document.
  - 17. The computer-readable medium of claim 10, further comprising:
    - code for generating an attachment match bitmap identifying which parts of selected ones of the plurality of documents correspond to an actual match with a first document, the selected ones of the plurality of documents identified as substantially similar to the first document based of a set of trigrams associated with the first document and a query into a full text index for the plurality of documents indexing the second set of trigrams for each document in the plurality of documents; and
      
      code for generating one or more user interfaces configured for displaying information identifying the selected ones of the plurality of documents as substantially similar to the first document and displaying which parts of the selected ones of the plurality of documents correspond to an actual match with the first document.
  - 18. The computer-readable medium of claim 10 wherein the code for determining the second set of trigrams for each document in the plurality of documents based on the first set of trigrams for the document and first filter criteria comprises code for filtering the first set of trigrams for each document to determine the second set of trigrams for the document as a predetermined number of most frequently occurring trigrams.

19. An e-discovery system comprising:
- a processor; and
  
  a memory in communication with the processor and configured to store processor-executable instructions which configured the processor to;
  
  receive a plurality of documents, each document in the plurality of documents having a plurality of terms;
  
  generate a first set of trigrams for each document in the plurality of documents, each trigram in the first set of trigrams for a given document in the plurality of documents being a sequence in the given document of three terms in the plurality of terms of the given document;
  
  determine a second set of trigrams for each document in the plurality of documents based on the first set of trigrams for the document and first filter criteria, the second set of trigrams for a given document in the plurality of documents being a subset of the first set of trigrams for the given document and having one or more trigrams that satisfy the first filter criteria;
  
  store the second set of trigrams for each document in the plurality of documents as offsets into the document for each term of each trigram in the second set of trigrams;
  
  generate, with the one or more processors associated with the one or more computer systems, a full text index for the plurality of documents;
  
  index the second set of trigrams for each document in the plurality of documents into the full text index;
  
  receive a first document;
  
  determine a set of trigrams associated with the first document;
  
  generate a query into the full text index for the plurality of documents based of the set of trigrams associated with the first document;
  
  determine a first set of documents in the plurality of documents in response to executing the query on the full text index; and
  
  generating one or more user interfaces configured for displaying information identifying selected ones of the first set of documents as substantially similar to the first document.
- View Dependent Claims (20)
- - 20. The e-discovery system of claim 19 wherein the processor is further configured to:
    - filter the first set of trigrams for each document to determine the second set of trigrams for the document as a predetermined number of most frequently occurring trigrams;
      
      generate an artificial distance between at least some of the trigrams in the second set of trigrams for each document in the plurality of documents prior to indexing into the full text index;
      
      determine the selected ones of the first set of documents based on a cosine similarity between the first document and each document in the first set of documents;
      
      generate an attachment match bitmap identifying which parts of the selected ones of the first set of documents correspond to an actual match with the first document; and
      
      generate the one or more user interfaces configured for displaying which parts of the selected ones of the first set of documents correspond to an actual match with the first document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Veritas Technologies, LLC (Whitehouse Group Ltd.)
Original Assignee
Clearwell Systems Inc (Gen Digital Inc.)
Inventors
Rangan, Venkat, Desai, Malay, Shewale, Medha

Granted Patent

US 9,275,129 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/741
CPC Class Codes

G06F 16/31   Indexing; Data structures t...

G06F 16/316   Indexing structures

G06F 16/334   Query execution G06F16/335 ...

G06F 16/3344   using natural language anal...

G06F 16/3347   using vector based model

G06F 16/93   Document management systems

METHODS AND SYSTEMS TO EFFICIENTLY FIND SIMILAR AND NEAR-DUPLICATE EMAILS AND FILES

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

METHODS AND SYSTEMS TO EFFICIENTLY FIND SIMILAR AND NEAR-DUPLICATE EMAILS AND FILES

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links