Data classification methods using machine learning techniques

US 7,937,345 B2
Filed: 05/23/2007
Issued: 05/03/2011
Est. Priority Date: 07/12/2006
Status: Active Grant

First Claim

Patent Images

1. A method for adapting to a shift in document content, comprising:

receiving at least one labeled seed document;

receiving unlabeled documents;

receiving at least one predetermined cost factor;

training a transductive classifier using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents;

classifying the unlabeled documents having a confidence level above a predefined threshold into a plurality of categories using the classifier;

reclassifying at least some documents previously categorized by a different classifier into the categories using the classifier; and

outputting identifiers of the categorized documents to at least one of a user, another system, and another process.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for adapting to a shift in document content according to one embodiment of the present invention includes receiving at least one labeled seed document; receiving unlabeled documents; receiving at least one predetermined cost factor; training a transductive classifier using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents; classifying the unlabeled documents having a confidence level above a predefined threshold into a plurality of categories using the classifier; reclassifying at least some of the categorized documents into the categories using the classifier; and outputting identifiers of the categorized documents to at least one of a user, another system, and another process. Methods for separating documents are also presented. Methods for document searching are also presented.

118 Citations

View as Search Results

16 Claims

1. A method for adapting to a shift in document content, comprising:
- receiving at least one labeled seed document;
  
  receiving unlabeled documents;
  
  receiving at least one predetermined cost factor;
  
  training a transductive classifier using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents;
  
  classifying the unlabeled documents having a confidence level above a predefined threshold into a plurality of categories using the classifier;
  
  reclassifying at least some documents previously categorized by a different classifier into the categories using the classifier; and
  
  outputting identifiers of the categorized documents to at least one of a user, another system, and another process.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, further comprising moving an unlabeled document having a confidence level below the predefined threshold into one or more new categories.
  - 3. The method of claim 1, and further comprising training the transductive classifier through iterative calculation using at least one predetermined cost factor, the at least one seed document, and the unlabeled documents, wherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value, and using the trained classifier to classify the unlabeled documents.
  - 4. The method of claim 3, further comprising receiving a data point label prior probability for the seed document and unlabeled documents, wherein for each iteration of the calculations the data point label prior probability is adjusted according to an estimate of a data point class membership probability.
  - 5. The method of claim 1, wherein the unlabeled documents are customer complaints, and further comprising linking product changes with customer complaints.
  - 6. The method of claim 1, wherein the unlabeled documents are invoices.

7. A method for separating documents, comprising:
- receiving labeled data;
  
  receiving a sequence of unlabeled documents;
  
  adapting probabilistic classification rules using transduction based on the labeled data and the unlabeled documents;
  
  updating weights used for document separation according to the probabilistic classification rules;
  
  determining locations of separations between the documents in the sequence of documents according to said probabilistic classification rules;
  
  outputting indicators of the determined locations of the separations in the sequence to at least one of a user, another system, and another process; and
  
  flagging the documents with codes, the codes correlating to the indicators.

8. A method for document searching, comprising:
- receiving a search query;
  
  retrieving documents based on the search query;
  
  outputting the documents;
  
  receiving user-entered labels for at least some of the documents, the labels being indicative of a relevance of the document to the search query;
  
  training a classifier based on the search query and the user-entered labels;
  
  performing a document classification technique on the documents using the classifier for reclassifying the documents; and
  
  outputting identifiers of at least some of the documents based on the classification thereof.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The method of claim 8, wherein the document classification technique includes a transductive process.
  - 10. The method of claim 9, wherein the classifier is a transductive classifier, and further comprising training the transductive classifier through iterative calculation using at least one predetermined cost factor, the search query, and the documents, wherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value, and using the trained classifier to classify the documents.
  - 11. The method of claim 10, further comprising receiving a data point label prior probability for the search query and documents, wherein for each iteration of the calculations the data point label prior probability is adjusted according to an estimate of a data point class membership probability.
  - 12. The method of claim 8, wherein the document classification technique includes a support vector machine process.
  - 13. The method of claim 8, wherein the document classification technique includes a maximum entropy discrimination process.
  - 14. The method of claim 8, wherein the reclassified documents are output, those documents having a highest confidence being output first.

15. A method for document searching, comprising:
- receiving a search query;
  
  retrieving documents based on the search query;
  
  outputting the documents;
  
  receiving user-entered labels for at least some of the documents, the labels being indicative of a relevance of the document to the search query;
  
  training a transductive classifier based on the search query and the user-entered labels, wherein the transductive classifier is trained through iterative calculation using at least one predetermined cost factor, the search query, and the documents, wherein for each iteration of the calculations the cost factor is adjusted as a function of an expected label value, and using the trained classifier to classify the documents;
  
  performing a document classification technique on at least some of the documents using the classifier for classifying the at least some of the documents; and
  
  outputting identifiers of the at least some of the documents based on the classification thereof.
- View Dependent Claims (16)
- - 16. The method of claim 15, further comprising receiving a data point label prior probability for the search query and documents, wherein for each iteration of the calculations the data point label prior probability is adjusted according to an estimate of a data point class membership probability.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kofax Incorporated
Original Assignee
Kofax Incorporated
Inventors
Borrey, Roland, Schmidtler, Mauritius A. R.
Primary Examiner(s)
Fernandez Rivas; Omar F

Application Number

US11/752,719
Publication Number

US 20080086433A1
Time in Patent Office

1,441 Days
Field of Search

706/12, 706/15, 706/20, 706/21, 706/45, 706/52, 706/62
US Class Current

706/20
CPC Class Codes

G06F 16/353   into predefined classes

G06N 20/00   Machine learning

G06N 20/10   using kernel methods, e.g. ...

Data classification methods using machine learning techniques

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

118 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Data classification methods using machine learning techniques

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

118 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links