System and method for training data generation in predictive coding

US 9,607,272 B1
Filed: 03/15/2013
Issued: 03/28/2017
Est. Priority Date: 10/05/2012
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

determining to improve an effectiveness measure of a first trained classification model, wherein the first trained model is trained using a set of training documents;

selecting a plurality of unlabeled documents, wherein the plurality of unlabeled documents are not part of the set of training documents used to train the first trained classification model;

generating a support vector based on a determination that one or more of the plurality of unlabeled documents are within a margin of a decision hyperplane associated with the first trained classification model;

calculating, by a processor in a predictive coding system, an overall score for each unlabeled document of the plurality of unlabeled documents based on a distance of a respective unlabeled document to the decision hyperplane and an angle diversity of the respective unlabeled document;

comparing, by the processor in the predictive coding system, the overall scores of the unlabeled documents to each other to select a pre-determined number of unlabeled documents having lowest scores in the plurality of unlabeled documents;

updating, by the processor in the predictive coding system, the set of training documents used to train the first trained classification model by adding the pre-determined number of unlabeled documents having the lowest scores in the plurality of unlabeled documents to the set of training documents;

updating the decision hyperplane based on the support vector;

providing, by the predictive coding system, the updated set of training documents to the first trained classification model to improve the effectiveness measure of the first trained classification model by generating a second trained classification model from the updated set of training documents;

identifying an effectiveness measure of the second trained classification model; and

generating a third trained classification model based on a determination that the effectiveness measure of the second trained classification model has improved from the effectiveness measure of the first trained classification model.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A predictive coding system updates a plurality of training documents for an untrained classification model based on a plurality of additional documents. The plurality of additional documents are selected from a plurality of unlabeled documents based on a decision hyperplane associated with a first trained classification model. The predictive coding system provides the updated plurality of training documents to the untrained classification model to cause the untrained classification model to be retrained and to cause a second trained classification model to be generated.

Citations

12 Claims

1. A method comprising:
- determining to improve an effectiveness measure of a first trained classification model, wherein the first trained model is trained using a set of training documents;
  
  selecting a plurality of unlabeled documents, wherein the plurality of unlabeled documents are not part of the set of training documents used to train the first trained classification model;
  
  generating a support vector based on a determination that one or more of the plurality of unlabeled documents are within a margin of a decision hyperplane associated with the first trained classification model;
  
  calculating, by a processor in a predictive coding system, an overall score for each unlabeled document of the plurality of unlabeled documents based on a distance of a respective unlabeled document to the decision hyperplane and an angle diversity of the respective unlabeled document;
  
  comparing, by the processor in the predictive coding system, the overall scores of the unlabeled documents to each other to select a pre-determined number of unlabeled documents having lowest scores in the plurality of unlabeled documents;
  
  updating, by the processor in the predictive coding system, the set of training documents used to train the first trained classification model by adding the pre-determined number of unlabeled documents having the lowest scores in the plurality of unlabeled documents to the set of training documents;
  
  updating the decision hyperplane based on the support vector;
  
  providing, by the predictive coding system, the updated set of training documents to the first trained classification model to improve the effectiveness measure of the first trained classification model by generating a second trained classification model from the updated set of training documents;
  
  identifying an effectiveness measure of the second trained classification model; and
  
  generating a third trained classification model based on a determination that the effectiveness measure of the second trained classification model has improved from the effectiveness measure of the first trained classification model.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein generating the second trained classification model further comprises:
    - upon determining to generate the second trained classification model, repeating, by the predictive coding system, the updating of the set of the training documents.
  - 3. The method of claim 1, further comprisingcalculating the effectiveness measure of the second trained classification model on a set of validation documents.
  - 4. The method of claim 1, wherein calculating the overall score for each unlabeled document comprises:
    - calculating the distance from the respective unlabeled document to the decision hyperplane;
      
      calculating the angle diversity value for the respective unlabeled document;
      
      applying a parameter value to the distance and the angle diversity value of the respective unlabeled document; and
      
      calculating the overall score for the respective unlabeled document based on a sum of the parameter value being applied to the distance and the parameter value being applied to the angle diversity value.

5. A non-transitory computer-readable storage medium having instructions that, when executed by a processing device, cause the processing device to perform operations comprising:
- determining to improve an effectiveness measure of a first trained classification model, wherein the first trained model is trained using a set of training documents;
  
  selecting a plurality of unlabeled documents, wherein the plurality of unlabeled documents are not part of the set of training documents used to train the first trained classification model;
  
  generating a support vector based on a determination that one or more of the plurality of unlabeled documents are within a margin of a decision hyperplane associated with the first trained classification model;
  
  calculating, by the processing device, an overall score for each unlabeled document based on a distance of a respective unlabeled document to the decision hyperplane and an angle diversity of the respective unlabeled document;
  
  comparing, by the processing device, the overall scores of the unlabeled documents to each other to select a pre-determined number of unlabeled documents having lowest scores in the plurality of unlabeled documents;
  
  updating the set of training documents used to train the first trained classification model based by adding the predetermined number of unlabeled documents having the lowest scores in the plurality of unlabeled documents to the set of training documents;
  
  updating the decision hyperplane based on the support vector;
  
  providing the updated set of training documents to the first trained classification model to improve the effectiveness measure of the first trained classification model by generating a second trained classification model from the updated set of training documents;
  
  identifying an effectiveness measure of the second trained classification model; and
  
  generating a third trained classification model based on a determination that the effectiveness measure of the second trained classification model has improved from the effectiveness measure of the first trained classification model.
- View Dependent Claims (6, 7, 8)
- - 6. The non-transitory computer-readable storage medium of claim 5, wherein generating the second trained classification model further comprises:
    - upon determining to generate the second trained classification model, repeating the updating of the set of the training documents.
  - 7. The non-transitory computer-readable storage medium of claim 5, further comprising:
    - calculating the effectiveness measure of the second trained classification model on a set of validation documents.
  - 8. The non-transitory computer-readable storage medium of claim 5, wherein calculating the overall score for each unlabeled document comprises:
    - calculating the distance from the respective unlabeled document to the decision hyperplane;
      
      calculating the angle diversity value for the respective unlabeled document;
      
      applying a parameter value to the distance and the angle diversity value of the respective unlabeled document; and
      
      calculating the overall score for the respective unlabeled document based on a sum of the parameter value being applied to the distance and the parameter value being applied to the angle diversity value.

9. A system comprising:
- a memory; and
  
  a processing device coupled to the memory, wherein the processing device is to;
  
  determine to improve an effectiveness measure of a first trained classification model, wherein the first trained model is trained using a set of training documents;
  
  select a plurality of unlabeled documents, wherein the plurality of unlabeled documents are not part of the set of training documents used to train the first trained classification model;
  
  generate a support vector based on a determination that one or more of the plurality of unlabeled documents are within a margin of a decision hyperplane associated with the first trained classification model;
  
  calculate an overall score for each unlabeled document based on a distance of a respective unlabeled document to the decision hyperplane and an angle diversity of the respective unlabeled document;
  
  compare the overall scores of the unlabeled documents to each other to select a pre-determined number of unlabeled documents having lowest scores in the plurality of unlabeled documents;
  
  update the set of training documents used to train the first trained classification model by adding the pre-determined number of unlabeled documents having the lowest scores in the plurality of unlabeled documents to the set of training documents;
  
  update the decision hyperplane based on the support vector;
  
  provide the updated set of training documents to the first trained classification model to improve the effectiveness measure of the first trained classification model by generating a second trained classification model from the updated set of training documents;
  
  identify an effectiveness measure of the second trained classification model; and
  
  generate a third trained classification model based on a determination that the effectiveness measure of the second trained classification model has improved from the effectiveness measure of the first trained classification model.
- View Dependent Claims (10, 11, 12)
- - 10. The system of claim 9, wherein the processing device is further to,upon determining to generate the second trained classification model, repeat the updating of the set of training documents.
  - 11. The system of claim 9, wherein the processing device further calculates the effectiveness measure of the second trained classification model on a set of validation documents.
  - 12. The system of claim 9, wherein the processing device is to calculate the overall score for each unlabeled document by:
    - calculating the distance from the respective unlabeled document to the decision hyperplane;
      
      calculating the angle diversity value for the respective unlabeled document;
      
      applying a parameter value to the distance and the angle diversity value; and
      
      calculating the overall score for the respective unlabeled document based on a sum of the parameter value being applied to the distance and the parameter value being applied to the angle diversity value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Veritas Technologies, LLC (Whitehouse Group Ltd.)
Original Assignee
Veritas Technologies, LLC (Whitehouse Group Ltd.)
Inventors
Yu, Shengke, Rangan, Venkat
Primary Examiner(s)
Misir, Dave

Application Number

US13/843,501
Time in Patent Office

1,474 Days
Field of Search

706/12
US Class Current

1/1
CPC Class Codes

G06N 20/00 Machine learning

G06N 20/10 using kernel methods, e.g. ...

System and method for training data generation in predictive coding

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for training data generation in predictive coding

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links