METHODS AND SYSTEMS FOR AUTOMATIC EVALUATION OF ELECTRONIC DISCOVERY REVIEW AND PRODUCTIONS

US 20120296891A1
Filed: 05/18/2011
Published: 11/22/2012
Est. Priority Date: 01/23/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for evaluating a search process, the method comprising:

receiving, at the one or more computer systems, information identifying in a collection of documents a first set of documents that satisfy search criteria associated with a first search;

determining, with one or more processor associated with the one or more computer systems, a document feature vector for each document in the first set of documents;

receiving, with the one or more processors associated with the one or more computer systems, information identifying in the documents in the collection of documents that do not satisfy the search criteria associated with the first search a second set of documents that satisfy first sampling criteria;

determining, with the one or more processor associated with the one or more computer systems, a document feature vector for each document in the second set of documents;

determining, with the one or more processor associated with the one or more computer systems, whether a second search of the collection results in new document gain based on the document feature vector for each document in the first set of documents and the document feature vector for at least one document in the second set of documents; and

generating, with the one or more processor associated with the one or more computer systems, information indicative of whether the second search of the collection results in new document gain.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are provided for automatic sampling evaluation. An automatic sampling evaluation system enables users to evaluate convergence of one or more search processes. For example, given a set of searches that were validated by human review, a system can implement a retrieval process that samples one or more non-retrieved collections. Each individual document'"'"'s similarity in the one or more non-retrieved collections is automatically evaluated to other documents in any retrieved sets. Given a goal of achieving a high recall, documents with high similarity can then be analyzed for additional noun phrases that may be used for a next iteration of a search. Convergence can be expected if the information gain in the new feedback loop is less than previous iterations, and if the additional documents identified are below a certain threshold document count.

344 Citations

20 Claims

1. A computer-implemented method for evaluating a search process, the method comprising:
- receiving, at the one or more computer systems, information identifying in a collection of documents a first set of documents that satisfy search criteria associated with a first search;
  
  determining, with one or more processor associated with the one or more computer systems, a document feature vector for each document in the first set of documents;
  
  receiving, with the one or more processors associated with the one or more computer systems, information identifying in the documents in the collection of documents that do not satisfy the search criteria associated with the first search a second set of documents that satisfy first sampling criteria;
  
  determining, with the one or more processor associated with the one or more computer systems, a document feature vector for each document in the second set of documents;
  
  determining, with the one or more processor associated with the one or more computer systems, whether a second search of the collection results in new document gain based on the document feature vector for each document in the first set of documents and the document feature vector for at least one document in the second set of documents; and
  
  generating, with the one or more processor associated with the one or more computer systems, information indicative of whether the second search of the collection results in new document gain.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 wherein determining, with the one or more processor associated with the one or more computer systems, whether the second search of the collection results in new document gain comprises determining that the document feature vector of the at least one document in the second set of documents satisfies similarity criteria associated with a document feature vector generated to represent all documents in the first set of documents.
  - 3. The method of claim 2 wherein determining that the document feature vector of the at least one document in the second set of documents satisfies the similarity criteria associated with the document feature vector generated to represent all documents in the first set of documents further comprises determining that the similarity criteria is satisfied by a predetermined threshold likely to increase the number of documents produced in the second search.
  - 4. The method of claim 1 wherein determining, with the one or more processor associated with the one or more computer systems, whether the second search of the collection results in new document gain comprises determining that the document feature vector of the at least one document in the second set of documents satisfies similarity criteria associated with the document feature vector for at least one document in the first set of documents.
  - 5. The method of claim 1 further comprising:
    - determining, with the one or more processor associated with the one or more computer systems, a set of noun phrases associated with the second search based on the at least one document in the second set of documents; and
      
      generating, with the one or more processor associated with the one or more computer systems, search criteria associated with the second search based on the search criteria associated with the first search and the determined set of noun phrases.
  - 6. The method of claim 5 further comprising:
    - determining, with the one or more processor associated with the one or more computer systems, whether a third search of the collection results in new document gain based on a document feature vector generated for each document in a third set of documents that satisfy the search criteria associated with the second search and a document feature vector generated for at least one document in a fourth set of documents identified in the documents in the collection of documents that do not satisfy the search criteria associated with the second search that satisfy second sampling criteria; and
      
      generating, with the one or more processor associated with the one or more computer systems, information indicative of whether the third search of the collection results in new document gain.
  - 7. The method of claim 1 wherein determining, with the one or more processor associated with the one or more computer systems, the document feature vector for each document in the first set of documents comprises:
    - determining a plurality of term feature vectors for the document; and
      
      generating the document feature vector for the document based on each term vector in the plurality of term vectors.

8. A non-transitory computer-readable medium storing computer-executable code for evaluating a search process, the non-transitory computer-readable medium comprising:
- code for receiving information identifying in a collection of documents a first set of documents that satisfy search criteria associated with a first search;
  
  code for determining a document feature vector for each document in the first set of documents;
  
  code for receiving information identifying in the documents in the collection of documents that do not satisfy the search criteria associated with the first search a second set of documents that satisfy first sampling criteria;
  
  code for determining a document feature vector for each document in the second set of documents;
  
  code for determining whether a second search of the collection results in new document gain based on the document feature vector for each document in the first set of documents and the document feature vector for at least one document in the second set of documents; and
  
  code for generating information indicative of whether the second search of the collection results in new document gain.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The non-transitory computer-readable medium of claim 8 wherein the code for determining whether the second search of the collection results in new document gain comprises code for determining that the document feature vector of the at least one document in the second set of documents satisfies similarity criteria associated with a document feature vector generated to represent all documents in the first set of documents.
  - 10. The non-transitory computer-readable medium of claim 9 wherein the code for determining that the document feature vector of the at least one document in the second set of documents satisfies the similarity criteria associated with the document feature vector generated to represent all documents in the first set of documents further comprises code for determining that the similarity criteria is satisfied by a predetermined threshold likely to increase the number of documents produced in the second search.
  - 11. The non-transitory computer-readable medium of claim 8 wherein the code for determining whether the second search of the collection results in new document gain comprises code for determining that the document feature vector of the at least one document in the second set of documents satisfies similarity criteria associated with the document feature vector for at least one document in the first set of documents.
  - 12. The non-transitory computer-readable medium of claim 8 further comprising:
    - code for determining a set of noun phrases associated with the second search based on the at least one document in the second set of documents; and
      
      code for generating search criteria associated with the second search based on the search criteria associated with the first search and the determined set of noun phrases.
  - 13. The non-transitory computer-readable medium of claim 12 further comprising:
    - code for determining whether a third search of the collection results in new document gain based on a document feature vector generated for each document in a third set of documents that satisfy the search criteria associated with the second search and a document feature vector generated for at least one document in a fourth set of documents identified in the documents in the collection of documents that do not satisfy the search criteria associated with the second search that satisfy second sampling criteria; and
      
      code for generating information indicative of whether the third search of the collection results in new document gain.
  - 14. The non-transitory computer-readable medium of claim 8 wherein the code for determining the document feature vector for each document in the first set of documents comprises:
    - code for determining a plurality of term feature vectors for the document; and
      
      code for generating the document feature vector for the document based on each term vector in the plurality of term vectors.

15. A system for evaluating search process of electronic discovery investigations, the system comprising:
- a processor; and
  
  a memory configured to store a set of instructions which when executed by the processor configure the processor to;
  
  receive information identifying in a collection of documents a first set of documents that satisfy search criteria associated with a first search;
  
  determine a document feature vector for each document in the first set of documents;
  
  receive information identifying in the documents in the collection of documents that do not satisfy the search criteria associated with the first search a second set of documents that satisfy first sampling criteria;
  
  determine a document feature vector for each document in the second set of documents;
  
  determine whether a second search of the collection results in new document gain based on the document feature vector for each document in the first set of documents and the document feature vector for at least one document in the second set of documents; and
  
  generate information indicative of whether the second search of the collection results in new document gain.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system of claim 15 wherein to determine whether the second search of the collection results in new document gain the processor is configured to determine that the document feature vector of the at least one document in the second set of documents satisfies similarity criteria associated with a document feature vector generated to represent all documents in the first set of documents.
  - 17. The system of claim 16 wherein to determine that the document feature vector of the at least one document in the second set of documents satisfies the similarity criteria associated with the document feature vector generated to represent all documents in the first set of documents the processor is further configured to determine that the similarity criteria is satisfied by a predetermined threshold likely to increase the number of documents produced in the second search.
  - 18. The system of claim 15 wherein to determine whether the second search of the collection results in new document gain the processor is configured to determine that the document feature vector of the at least one document in the second set of documents satisfies similarity criteria associated with the document feature vector for at least one document in the first set of documents.
  - 19. The system of claim 15 wherein the processor is further configured to:
    - determine a set of noun phrases associated with the second search based on the at least one document in the second set of documents; and
      
      generate search criteria associated with the second search based on the search criteria associated with the first search and the determined set of noun phrases.
  - 20. The system of claim 19 wherein the processors is further configured to:
    - determine whether a third search of the collection results in new document gain based on a document feature vector generated for each document in a third set of documents that satisfy the search criteria associated with the second search and a document feature vector generated for at least one document in a fourth set of documents identified in the documents in the collection of documents that do not satisfy the search criteria associated with the second search that satisfy second sampling criteria; and
      
      generate information indicative of whether the third search of the collection results in new document gain.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Veritas Technologies, LLC (Whitehouse Group Ltd.)
Original Assignee
Clearwell Systems (NortonLifeLock Inc.)
Inventors
Rangan, Venkat

Granted Patent

US 9,600,568 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/722
CPC Class Codes

G06F 16/3347 using vector based model

METHODS AND SYSTEMS FOR AUTOMATIC EVALUATION OF ELECTRONIC DISCOVERY REVIEW AND PRODUCTIONS

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

344 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

METHODS AND SYSTEMS FOR AUTOMATIC EVALUATION OF ELECTRONIC DISCOVERY REVIEW AND PRODUCTIONS

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

344 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others