DATA CLASSIFICATION USING MACHINE LEARNING TECHNIQUES
First Claim
Patent Images
1. A system for classifying documents, comprising:
- a memory; and
a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory,wherein the memory stores computer executable program code comprising instructions for;
receiving at least one labeled seed document having a known confidence level of label assignment;
receiving unlabeled documents;
receiving at least one predetermined cost factor;
training a transductive classifier through iterative calculation using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents, wherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value;
after at least some of the iterations, storing confidence scores for the unlabeled documents; and
outputting identifiers of the unlabeled documents having the highest confidence scores to at least one of a user, another system, and another process.
9 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods and computer program products for classifying documents are presented. Systems, methods and computer program products for analyzing documents, e.g., associated with legal discovery are also presented. Systems, methods and computer program products for cleaning up data are also presented. Systems, methods and computer program products for verifying an association of an invoice with an entity are also presented. Systems, methods and computer program products for managing medical records are presented. Systems, methods and computer program products for face recognition are presented.
-
Citations
39 Claims
-
1. A system for classifying documents, comprising:
-
a memory; and a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory, wherein the memory stores computer executable program code comprising instructions for; receiving at least one labeled seed document having a known confidence level of label assignment; receiving unlabeled documents; receiving at least one predetermined cost factor; training a transductive classifier through iterative calculation using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents, wherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value; after at least some of the iterations, storing confidence scores for the unlabeled documents; and outputting identifiers of the unlabeled documents having the highest confidence scores to at least one of a user, another system, and another process. - View Dependent Claims (2, 3, 4)
-
-
5. A system for analyzing documents, comprising:
-
a memory; and a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory, wherein the memory stores computer executable program code comprising instructions for; receiving documents; performing a document classification technique on the documents using a transductive classifier trained through iterative calculation using at least one predetermined cost factor and at least one seed docuMent; and outputting identifiers of at least some of the documents based on the classification thereof. - View Dependent Claims (6, 7, 8, 9, 10, 11)
-
-
12. A system for cleaning up data, comprising:
-
a memory; and a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory, wherein the memory stores computer executable program code comprising instructions for; receiving a plurality of labeled data items; selecting subsets of the data items for each of a plurality of categories; setting an uncertainty for the data items in each subset to about zero; setting an uncertainty for the data items not in the subsets to a predefined value that is not about zero; training a transductive classifier through iterative calculation using the uncertainties, the data items in the subsets, and the data items not in the subsets as training examples; applying the trained classifier to each of the labeled data items to classify each of the data items; and outputting a classification of the input data items, or derivative thereof, to at least one of a user, another system, and another process. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A system for verifying an association of an invoice with an entity, comprising:
-
a memory; and a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory, wherein the memory stores computer executable program code comprising instructions for; training a classifier based on an invoice format associated with a first entity; accessing a plurality of invoices labeled as being associated with at least one of the first entity and other entities; performing a document classification technique on the invoices using the classifier; and outputting an identifier of at least one of the invoices having a high probability of not being associated with the first entity. - View Dependent Claims (18, 19, 20, 21, 22)
-
-
23. A system for managing medical records, comprising:
-
a memory; and a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory, wherein the memory stores computer executable program code comprising instructions for; training a classifier based on a medical diagnosis; accessing a plurality of medical records; performing a document classification technique on the medical records using the classifier; and outputting an identifier of at least one of the medical records having a low probability of being associated with the medical diagnosis. - View Dependent Claims (24, 25, 26, 27, 28)
-
-
29. A system for face recognition, comprising:
-
a memory; and a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory, wherein the memory stores computer executable program code comprising instructions for; receiving at least one labeled seed image of a face, the seed image having a known confidence level; receiving unlabeled images; receiving at least one predetermined cost factor; training a transductive classifier through iterative calculation using the at least one predetermined cost factor, the at least one seed image, and the unlabeled images, wherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value; after at least some of the iterations, storing confidence scores for the unlabeled seed images; and outputting identifiers of the unlabeled images having the highest confidence scores to at least one of a user, another system, and another process. - View Dependent Claims (30, 31, 32, 33)
-
-
34. A product for classifying documents, comprising:
a program storage medium readable by a computer, where the medium tangibly embodies one or more programs of instructions executable by the computer to perform a method, comprising; receiving at least one labeled seed document having a known confidence level of label assignment; receiving unlabeled documents; receiving at least one predetermined cost factor; training a transductive classifier through iterative calculation using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents, wherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value; after at least some of the iterations, storing confidence scores for the unlabeled documents; and outputting identifiers of the unlabeled documents having the highest confidence scores to at least one of a user, another system, and another process.
-
35. A product for analyzing documents, comprising:
a program storage medium readable by a computer, where the medium tangibly embodies one or more programs of instructions executable by the computer to perform a method, comprising; receiving documents; performing a document classification technique on the documents using a transductive classifier trained through iterative calculation using at least one predetermined cost factor and at least one seed document; and outputting identifiers of at least some of the documents based on the classification thereof.
-
36. A product for cleaning up data, comprising:
a program storage medium readable by a computer, where the medium tangibly embodies one or more programs of instructions executable by the computer to perform a method, comprising; receiving a plurality of labeled data items; selecting subsets of the data items for each of a plurality of categories; setting an uncertainty for the data items in each subset to about zero; setting an Uncertainty for the data items not in the subsets to a predefined value that is not about zero; training a transductive classifier through iterative calculation using the uncertainties, the data items in the subsets, and the data items not in the subsets as training examples; applying the trained classifier to each of the labeled data items to classify each of the data items; and outputting a classification of the input data items, or derivative thereof, to at least one of a user, another system, and another process.
-
37. A product for verifying an association of an invoice with an entity, comprising:
a program storage medium readable by a computer, where the medium tangibly embodies one or more programs of instructions executable by the computer to perform a method, comprising; training a classifier based on an invoice format associated with a first entity; accessing a plurality of invoices labeled as being associated with at least one of the first entity and other entities; performing a document classification technique on the invoices using the classifier; and outputting an identifier of at least one of the invoices having a high probability of not being associated with the first entity.
-
38. A product for managing medical records, comprising:
a program storage medium readable by a computer, where the medium tangibly embodies one or more programs of instructions executable by the computer to perform a method, comprising; training a classifier based on a medical diagnosis; accessing a plurality of medical records; performing a document classification technique on the medical records using the classifier; and outputting an identifier of at least one of the medical records having a low probability of being associated with the medical diagnosis.
-
39. A product for face recognition, comprising:
a program storage medium readable by a computer, where the medium tangibly embodies one or more programs of instructions executable by the computer to perform a method, comprising; receiving at least one labeled seed image of a face, the seed image having a known confidence level; receiving unlabeled images; receiving at least one predetermined cost factor; training a transductive classifier through iterative calculation using the at least one predetermined cost factor, the at least one seed image, and the unlabeled images, wherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value; after at least some of the iterations, storing confidence scores for the unlabeled seed images; and outputting identifiers of the unlabeled images having the highest confidence scores to at least one of a user, another system, and another process.
Specification