Data classification using machine learning techniques

US 8,719,197 B2
Filed: 04/19/2011
Issued: 05/06/2014
Est. Priority Date: 07/12/2006
Status: Active Grant

First Claim

Patent Images

1. A system for classifying documents, comprising:

a memory; and

a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory,wherein the memory stores computer executable program code comprising instructions for;

receiving at least one labeled seed document having a known confidence level of label assignment;

receiving unlabeled documents;

receiving at least one predetermined cost factor;

training a transductive classifier through iterative calculation using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents, wherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value;

after at least some of the iterations, storing confidence scores for the unlabeled documents; and

outputting identifiers of the unlabeled documents having the highest confidence scores to at least one of a user, another system, and another process.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods and computer program products for classifying documents are presented. Systems, methods and computer program products for analyzing documents, e.g., associated with legal discovery are also presented. Systems, methods and computer program products for cleaning up data are also presented. Systems, methods and computer program products for verifying an association of an invoice with an entity are also presented. Systems, methods and computer program products for managing medical records are presented. Systems, methods and computer program products for face recognition are presented.

Citations

34 Claims

1. A system for classifying documents, comprising:
- a memory; and
  
  a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory,wherein the memory stores computer executable program code comprising instructions for;
  
  receiving at least one labeled seed document having a known confidence level of label assignment;
  
  receiving unlabeled documents;
  
  receiving at least one predetermined cost factor;
  
  training a transductive classifier through iterative calculation using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents, wherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value;
  
  after at least some of the iterations, storing confidence scores for the unlabeled documents; and
  
  outputting identifiers of the unlabeled documents having the highest confidence scores to at least one of a user, another system, and another process.
- View Dependent Claims (2, 3, 4)
- - 2. The system of claim 1, wherein the at east one seed document has a list of keywords.
  - 3. The system of claim 1, wherein confidence scores are stored after each of the iterations, wherein an identifier of the unlabeled document having the highest confidence score after each iteration is output.
  - 4. The system of claim 1, wherein the computer executable program code further comprises instructions for receiving a data point label prior probability for the labeled and unlabeled documents, wherein for each iteration of the calculations the data point label prior probability is adjusted according to an estimate of a data point class membership probability.

5. A system for analyzing documents, comprising:
- a memory; and
  
  a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory,wherein the memory stores computer executable program code comprising instructions for;
  
  training a transductive classifier;
  
  receiving documents;
  
  performing a document classification technique on the documents using the transductive classifier trained through iterative calculation using at least one predetermined cost factor and at least one seed document, wherein for each iteration of calculations during the training, the cost factor is adjusted as a function of an expected label value; and
  
  outputting identifiers of at least some of the documents based on the classification thereof.
- View Dependent Claims (6, 7, 8, 9, 10, 11)
- - 6. The system of claim 5, wherein the documents are associated with a legal matter.
  - 7. The system of claim 5, wherein the computer executable program code further comprises instructions for training the transductive classifier, wherein for each iteration of the calculations during the training, the cost factor is adjusted as a function of an expected label value.
  - 8. The system of claim 5, wherein the computer executable program code further comprises instructions for receiving a data point label prior probability for labeled and unlabeled documents, wherein for each iteration of the calculations the data point label prior probability is adjusted according to an estimate of a data point class membership probability.
  - 9. The system of claim 5, wherein the document classification technique includes a support vector machine process.
  - 10. The system of claim 5, wherein the document classification technique includes a maximum entropy discrimination process.
  - 11. The system of claim 5, wherein the computer executable program code further comprises instructions for outputting a representation of links between the documents.

12. A system for cleaning up data, comprising:
- a memory; and
  
  a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory,wherein the memory stores computer executable program code comprising instructions for;
  
  receiving a plurality of labeled data items;
  
  selecting subsets of the data items for each of a plurality of categories;
  
  setting an uncertainty for the data items in each subset to about zero;
  
  setting an uncertainty for the data items not in the subsets to a predefined value that is not about zero;
  
  training a transductive classifier through iterative calculation using the uncertainties, the data items in the subsets, and the data items not in the subsets as training examples;
  
  applying the trained classifier to each of the labeled data items to classify each of the data items; and
  
  outputting a classification of the input data items, or derivative thereof, to at least one of a user, another system, and another process.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The system of claim 12, wherein the subsets are selected at random.
  - 14. The system of claim 12, wherein the subsets are selected and verified by a user.
  - 15. The system of claim 12, wherein the computer executable program code further comprises instructions for changing the label of at least some of the data items based on the classification.
  - 16. The system of claim 12, wherein identifiers of data items having a confidence level below a predefined threshold after classification thereof are output to a user.

17. A system for verifying an association of an invoice with an entity, comprising:
- a memory; and
  
  a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory, wherein the memory stores computer executable program code comprising instructions for;
  
  training a classifier based on an invoice format associated with a first entity;
  
  accessing a plurality of invoices labeled as being associated with at least one of the first entity and other entities;
  
  performing a document classification technique on the invoices using the classifier; and
  
  outputting an identifier of at least one of the invoices having a high probability of not being associated with the first entity,wherein the classifier is a transductive classifier, and further comprising training the transductive classifier through iterative calculation using at least one predetermined cost factor, at least one seed document, and the invoices, wherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value.
- View Dependent Claims (18, 19, 20, 21)
- - 18. The system of claim 17, wherein the document classification technique includes a transductive process, wherein the invoice format includes a physical layout of markings on the invoice.
  - 19. The system of claim 17, wherein the computer executable program code further comprises instructions for receiving a data point label prior probability for the seed document and invoices, wherein for each iteration of the calculations the data point label prior probability is adjusted according to an estimate of a data point class membership probability.
  - 20. The system of claim 17, wherein the document classification technique includes a support vector machine process.
  - 21. The system of claim 17, wherein the document classification technique includes a maximum entropy discrimination process.

22. A system for managing medical records, comprising:
- a memory; and
  
  a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory,wherein the memory stores computer executable program code comprising instructions for;
  
  accessing a plurality of medical records;
  
  training a transductive classifier based on a medical diagnosis through iterative calculation using;
  
  at least one predetermined cost factor,at least one seed document, andthe medical records,performing a document classification technique on the medical records using the classifier; and
  
  outputting an identifier of at least one of the medical records having a low probability of being associated with the medical diagnosis,wherein the document classification technique includes a transductive process, andwherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value.
- View Dependent Claims (23, 24)
- - 23. The system of claim 22, wherein the computer executable program code further comprises instructions for receiving a data point label prior probability for the seed document and medical records, wherein for each iteration of the calculations the data point label prior probability is adjusted according to an estimate of a data point class membership probability.
  - 24. The system of claim 22, wherein the document classification technique includes a support vector machine process.

25. A system for managing medical records, comprising:
- a memory; and
  
  a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory,wherein the memory stores computer executable program code comprising instructions for;
  
  accessing a plurality of medical records;
  
  training a transductive classifier based on a medical diagnosis through iterative calculation using;
  
  at least one predetermined cost factor,at least one seed document, andthe medical records,performing a document classification technique on the medical records using the classifier, andoutputting an identifier of at least one of the medical records having a low probability of being associated with the medical diagnosis,wherein the document classification technique includes a maximum entropy discrimination process.

26. A system for face recognition, comprising:
- a memory; and
  
  a processor in communication with the memory, the processor being configured to process at least some instructions stored in the memory,wherein the memory stores computer executable program code comprising instructions for;
  
  receiving at least one labeled seed image of a face, the seed image having a known confidence level;
  
  receiving unlabeled images;
  
  receiving at least one predetermined cost factor;
  
  training a transductive classifier through iterative calculation using the at least one predetermined cost factor, the at least one seed image, and the unlabeled images, wherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value;
  
  after at least some of the iterations, storing confidence scores for the unlabeled seed images; and
  
  outputting identifiers of the unlabeled images having the highest confidence scores to at least one of a user, another system, and another process.
- View Dependent Claims (27, 28, 29, 30)
- - 27. The system of claim 26, wherein the at least one seed image has a label indicative of whether the image is included in a designated category.
  - 28. The system of claim 26, wherein confidence scores are stored after each of the iterations, wherein an identifier of the unlabeled images having the highest confidence score after each iteration is output.
  - 29. The system of claim 26, wherein the computer executable program code further comprises instructions for receiving a data point label prior probability for the labeled and unlabeled image, wherein for each iteration of the calculations the data point label prior probability is adjusted according to an estimate of a data point class membership probability.
  - 30. The system of claim 26, wherein the computer executable program code further comprises instructions for receiving a third unlabeled image of a face, comparing the third unlabeled image to at least some of the images having the highest confidence scores, and outputting an identifier of the third unlabeled image if a confidence that the face in the third unlabeled image is the same as the face in the seed image.

31. A product for classifying documents, comprising:
- a non-transitory storage medium readable by a computer, where the medium tangibly embodies one or more programs of instructions executable by the computer to perform a method, comprising;
  
  receiving at least one labeled seed document having a known confidence level of label assignment;
  
  receiving unlabeled documents;
  
  receiving at least one predetermined cost factor;
  
  training a transductive classifier through iterative calculation using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents, wherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value;
  
  after at least some of the iterations, storing confidence scores for the unlabeled documents; and
  
  outputting identifiers of the unlabeled documents having the highest confidence scores to at least one of a user, another system, and another process.

32. A product for analyzing documents, comprising:
- a non-transitory storage medium readable by a computer, where the medium tangibly embodies one or more programs of instructions executable by the computer to perform a method, comprising;
  
  training a transductive classifier;
  
  receiving documents;
  
  performing a document classification technique on the documents using the transductive classifier trained through iterative calculation using at least one predetermined cost factor and at least one seed document, wherein for each iteration of the calculations during the training the cost factor is adjusted as a function of an expected label value; and
  
  outputting identifiers of at least some of the documents based on the classification thereof.

33. A product for cleaning up data, comprising:
- a non-transitory storage medium readable by a computer, where the medium tangibly embodies one or more programs of instructions executable by the computer to perform a method, comprising;
  
  receiving a plurality of labeled data items;
  
  selecting subsets of the data items for each of a plurality of categories;
  
  setting an uncertainty for the data items in each subset to about zero;
  
  setting an uncertainty for the data items not in the subsets to a predefined value that is not about zero;
  
  training a transductive classifier through iterative calculation using the uncertainties, the data items in the subsets, and the data items not in the subsets as training examples;
  
  applying the trained classifier to each of the labeled data items to classify each of the data items; and
  
  outputting a classification of the input data items, or derivative thereof, to at least one of a user, another system, and another process.

34. A product for face recognition, comprising:
- a non-transitory storage medium readable by a computer, where the medium tangibly embodies one or more programs of instructions executable by the computer to perform a method, comprising;
  
  receiving at least one labeled seed image of a face, the seed image having a known confidence level;
  
  receiving unlabeled images;
  
  receiving at least one predetermined cost factor;
  
  training a transductive classifier through iterative calculation using the at least one predetermined cost factor, the at least one seed image, and the unlabeled images, wherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value;
  
  after at least some of the iterations, storing confidence scores for the unlabeled seed images; and
  
  outputting identifiers of the unlabeled images having the highest confidence scores to at least one of a user, another system, and another process.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Tungsten Automation Corp.
Original Assignee
Kofax Incorporated
Inventors
Schmidtler, Mauritius A. R., Borrey, Roland, Sarah, Anthony
Primary Examiner(s)
Chaki, Kakali
Assistant Examiner(s)
PELLETT, DANIEL T

Application Number

US13/090,216
Publication Number

US 20110196870A1
Time in Patent Office

1,113 Days
Field of Search

None
US Class Current

706/20
CPC Class Codes

G06F 16/93   Document management systems

G06N 20/00   Machine learning

G06N 20/10   using kernel methods, e.g. ...

G06Q 10/10   Office automation; Time man...

G06Q 50/18   Legal services

G16H 10/60   for patient-specific data, ...

G16H 50/20   for computer-aided diagnosi...

Y02A 90/10   Information and communicati...

Data classification using machine learning techniques

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

34 Claims

Specification

Solutions

Use Cases

Quick Links

Data classification using machine learning techniques

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

34 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links