System and method for training data generation in predictive coding
First Claim
Patent Images
1. A method comprising:
- determining to improve an effectiveness measure of a first trained classification model, wherein the first trained model is trained using a set of training documents;
selecting a plurality of unlabeled documents, wherein the plurality of unlabeled documents are not part of the set of training documents used to train the first trained classification model;
generating a support vector based on a determination that one or more of the plurality of unlabeled documents are within a margin of a decision hyperplane associated with the first trained classification model;
calculating, by a processor in a predictive coding system, an overall score for each unlabeled document of the plurality of unlabeled documents based on a distance of a respective unlabeled document to the decision hyperplane and an angle diversity of the respective unlabeled document;
comparing, by the processor in the predictive coding system, the overall scores of the unlabeled documents to each other to select a pre-determined number of unlabeled documents having lowest scores in the plurality of unlabeled documents;
updating, by the processor in the predictive coding system, the set of training documents used to train the first trained classification model by adding the pre-determined number of unlabeled documents having the lowest scores in the plurality of unlabeled documents to the set of training documents;
updating the decision hyperplane based on the support vector;
providing, by the predictive coding system, the updated set of training documents to the first trained classification model to improve the effectiveness measure of the first trained classification model by generating a second trained classification model from the updated set of training documents;
identifying an effectiveness measure of the second trained classification model; and
generating a third trained classification model based on a determination that the effectiveness measure of the second trained classification model has improved from the effectiveness measure of the first trained classification model.
7 Assignments
0 Petitions
Accused Products
Abstract
A predictive coding system updates a plurality of training documents for an untrained classification model based on a plurality of additional documents. The plurality of additional documents are selected from a plurality of unlabeled documents based on a decision hyperplane associated with a first trained classification model. The predictive coding system provides the updated plurality of training documents to the untrained classification model to cause the untrained classification model to be retrained and to cause a second trained classification model to be generated.
-
Citations
12 Claims
-
1. A method comprising:
-
determining to improve an effectiveness measure of a first trained classification model, wherein the first trained model is trained using a set of training documents; selecting a plurality of unlabeled documents, wherein the plurality of unlabeled documents are not part of the set of training documents used to train the first trained classification model; generating a support vector based on a determination that one or more of the plurality of unlabeled documents are within a margin of a decision hyperplane associated with the first trained classification model; calculating, by a processor in a predictive coding system, an overall score for each unlabeled document of the plurality of unlabeled documents based on a distance of a respective unlabeled document to the decision hyperplane and an angle diversity of the respective unlabeled document; comparing, by the processor in the predictive coding system, the overall scores of the unlabeled documents to each other to select a pre-determined number of unlabeled documents having lowest scores in the plurality of unlabeled documents; updating, by the processor in the predictive coding system, the set of training documents used to train the first trained classification model by adding the pre-determined number of unlabeled documents having the lowest scores in the plurality of unlabeled documents to the set of training documents; updating the decision hyperplane based on the support vector; providing, by the predictive coding system, the updated set of training documents to the first trained classification model to improve the effectiveness measure of the first trained classification model by generating a second trained classification model from the updated set of training documents; identifying an effectiveness measure of the second trained classification model; and generating a third trained classification model based on a determination that the effectiveness measure of the second trained classification model has improved from the effectiveness measure of the first trained classification model. - View Dependent Claims (2, 3, 4)
-
-
5. A non-transitory computer-readable storage medium having instructions that, when executed by a processing device, cause the processing device to perform operations comprising:
-
determining to improve an effectiveness measure of a first trained classification model, wherein the first trained model is trained using a set of training documents; selecting a plurality of unlabeled documents, wherein the plurality of unlabeled documents are not part of the set of training documents used to train the first trained classification model; generating a support vector based on a determination that one or more of the plurality of unlabeled documents are within a margin of a decision hyperplane associated with the first trained classification model; calculating, by the processing device, an overall score for each unlabeled document based on a distance of a respective unlabeled document to the decision hyperplane and an angle diversity of the respective unlabeled document; comparing, by the processing device, the overall scores of the unlabeled documents to each other to select a pre-determined number of unlabeled documents having lowest scores in the plurality of unlabeled documents; updating the set of training documents used to train the first trained classification model based by adding the predetermined number of unlabeled documents having the lowest scores in the plurality of unlabeled documents to the set of training documents; updating the decision hyperplane based on the support vector; providing the updated set of training documents to the first trained classification model to improve the effectiveness measure of the first trained classification model by generating a second trained classification model from the updated set of training documents; identifying an effectiveness measure of the second trained classification model; and generating a third trained classification model based on a determination that the effectiveness measure of the second trained classification model has improved from the effectiveness measure of the first trained classification model. - View Dependent Claims (6, 7, 8)
-
-
9. A system comprising:
-
a memory; and a processing device coupled to the memory, wherein the processing device is to; determine to improve an effectiveness measure of a first trained classification model, wherein the first trained model is trained using a set of training documents; select a plurality of unlabeled documents, wherein the plurality of unlabeled documents are not part of the set of training documents used to train the first trained classification model; generate a support vector based on a determination that one or more of the plurality of unlabeled documents are within a margin of a decision hyperplane associated with the first trained classification model; calculate an overall score for each unlabeled document based on a distance of a respective unlabeled document to the decision hyperplane and an angle diversity of the respective unlabeled document; compare the overall scores of the unlabeled documents to each other to select a pre-determined number of unlabeled documents having lowest scores in the plurality of unlabeled documents; update the set of training documents used to train the first trained classification model by adding the pre-determined number of unlabeled documents having the lowest scores in the plurality of unlabeled documents to the set of training documents; update the decision hyperplane based on the support vector; provide the updated set of training documents to the first trained classification model to improve the effectiveness measure of the first trained classification model by generating a second trained classification model from the updated set of training documents; identify an effectiveness measure of the second trained classification model; and generate a third trained classification model based on a determination that the effectiveness measure of the second trained classification model has improved from the effectiveness measure of the first trained classification model. - View Dependent Claims (10, 11, 12)
-
Specification