Methods and apparatus for automated image classification

US 8,671,112 B2
Filed: 06/12/2008
Issued: 03/11/2014
Est. Priority Date: 06/12/2008
Status: Active Grant

First Claim

Patent Images

1. A method of processing an unclassified electronic document image comprising information associated with a healthcare entity, the method comprising:

converting the image to a textual representation;

identifying at least one term in the textual representation, wherein the at least one term is represented in training data indicating a degree of association between the at least one term and a plurality of document classifications, wherein the training data includes information about a plurality of terms extracted from a plurality of documents during training;

determining, with at least one computer processor, for a first document classification of the plurality of document classifications, a first probability that the unclassified electronic document image belongs to the first document classification, wherein determining the first probability comprises for each term of the at least one term, multiplying a value from the training data indicating the degree of association between the term and the first document classification by an initial probability representing a percentage of historical documents that were classified using the first document classification, wherein the initial probability is scaled by a number of times the term appears in the textual representation;

assigning to the unclassified electronic document image, a document classification to produce a classified electronic document image, wherein the document classification is assigned based, at least in part, on the determined first probability;

associating a confidence score with the classified electronic document image, wherein the confidence score is determined based, at least in part, on a plurality of classification type probability values each indicating a likelihood that the classified electronic document image is associated with one of the plurality of document classifications; and

determining that the classified electronic document image was accurately classified if the confidence score exceeds a predetermined threshold value.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for automated classification of an image of an electronic document such as a facsimile document. The image is converted to a textual representation, and at least some of the terms in the textual representation may be associated with one or more predefined classification types, thereby enabling the document to be classified, and for multi-page documents, determining boundaries used to split the document into sections. The development of associations between terms and classification types may result from providing, to the system, a training set of manually-classified documents. A training module analyzes the training set to calculate probabilities that particular terms may appear in documents of a particular classification type. Probabilities established during training are used during automated document processing to assign a classification type to the document. A confidence score associated with the assigned classification type provides a metric for assessing the accuracy of the automated process.

35 Citations

View as Search Results

16 Claims

1. A method of processing an unclassified electronic document image comprising information associated with a healthcare entity, the method comprising:
- converting the image to a textual representation;
  
  identifying at least one term in the textual representation, wherein the at least one term is represented in training data indicating a degree of association between the at least one term and a plurality of document classifications, wherein the training data includes information about a plurality of terms extracted from a plurality of documents during training;
  
  determining, with at least one computer processor, for a first document classification of the plurality of document classifications, a first probability that the unclassified electronic document image belongs to the first document classification, wherein determining the first probability comprises for each term of the at least one term, multiplying a value from the training data indicating the degree of association between the term and the first document classification by an initial probability representing a percentage of historical documents that were classified using the first document classification, wherein the initial probability is scaled by a number of times the term appears in the textual representation;
  
  assigning to the unclassified electronic document image, a document classification to produce a classified electronic document image, wherein the document classification is assigned based, at least in part, on the determined first probability;
  
  associating a confidence score with the classified electronic document image, wherein the confidence score is determined based, at least in part, on a plurality of classification type probability values each indicating a likelihood that the classified electronic document image is associated with one of the plurality of document classifications; and
  
  determining that the classified electronic document image was accurately classified if the confidence score exceeds a predetermined threshold value.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein converting the unclassified electronic document image to a textual representation comprises applying an optical character recognition algorithm to the unclassified electronic document image.
  - 3. The method of claim 1, wherein the unclassified electronic document image is a multi-page document, and each page of the multi-page document is assigned a separate document classification.
  - 4. The method of claim 1, wherein the document classification is further based on at least one predetermined document classification category.

5. A method of classifying an image of an unclassified electronic document comprising information associated with a healthcare entity, the method comprising:
- generating, from the image, a data structure comprising at least one term appearing in the image;
  
  determining a term frequency for the at least one term, wherein the term frequency indicates a number of times that the at least one term appears in the image of the unclassified electronic document;
  
  determining, for a first document classification of a plurality of document classifications, a first probability that the unclassified electronic document image belongs to the first document classification, wherein determining the first probability comprises for each term of the at least one term, multiplying a value from training data indicating the degree of association between the term and the first document classification by an initial probability representing a percentage of historical documents that were classified using the first document classification, wherein the initial probability is scaled by the term frequency for the term;
  
  determining, for second first document classification of the plurality of document classifications, a second probability that the unclassified electronic document image belongs to the second document classification, wherein determining the second probability comprises for each term of the at least one term, multiplying a value from training data indicating the degree of association between the term and the second document classification by an initial probability representing a percentage of historical documents that were classified using the second document classification, wherein the initial probability is scaled by the term frequency for the term;
  
  determining at least one classification probability for the image of the unclassified electronic document, the at least one classification probability determined based, at least in part on, the determined first probability and the determined second probability;
  
  selecting a classification based, at least in part, on the first probability and the second probability; and
  
  assigning the classification to the image of the unclassified electronic document to produce a classified electronic document image.
- View Dependent Claims (6, 7, 8, 9, 10, 11)
- - 6. The method of claim 5, wherein the data structure is generated by processing the image with at least one optical character recognition algorithm.
  - 7. The method of claim 5, wherein the term frequency is the number of occurrences of the term in the data structure.
  - 8. The method of claim 5, wherein the unclassified electronic document is a facsimile document.
  - 9. The method of claim 5, wherein the healthcare entity is selected from the group consisting of a healthcare provider and a patient of a healthcare provider.
  - 10. The method of claim 5, further comprising, determining a confidence value associated with the classification.
  - 11. The method of claim 10, wherein determining a confidence value comprises determining a difference between the first probability and the second probability.

12. A computer-readable storage medium, encoded with a series of instructions, that when executed on a computer, perform a method of classifying an unclassified electronic document image according to a plurality of document classifications, wherein the unclassified electronic document image comprises information associated with a healthcare entity, the method comprising:
- parsing the unclassified electronic document image to identify a first term and a second term in the unclassified electronic document image;
  
  determining, for a first document classification of the plurality of document classifications, a first probability that the unclassified electronic document image belongs to the first document classification, wherein determining the first probability comprises multiplying a value from training data indicating the degree of association between the first term and the first document classification by an initial probability representing a percentage of historical documents that were classified using the first document classification, wherein the initial probability is scaled by a number of times the first term appears in the textual representation;
  
  determining, for the first document classification of the plurality of document classifications, a second probability that the unclassified electronic document image belongs to the first document classification, wherein determining the second probability comprises multiplying a value from training data indicating the degree of association between the second term and the first document classification by an initial probability representing a percentage of historical documents that were classified using the first document classification, wherein the initial probability is scaled by a number of times the second term appears in the textual representation;
  
  determining a combined probability based, at least in part, on the first probability and the second probability; and
  
  assigning a classification to the unclassified electronic document image to produce a classified electronic document image, wherein the classification is assigned based, at least in part, on the determined combined probability.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The computer-readable medium of claim 12, wherein parsing the unclassified electronic document image comprises processing the unclassified electronic document image using an optical character recognition algorithm.
  - 14. The computer readable medium of claim 12, wherein the unclassified electronic document image is a facsimile document.
  - 15. The computer readable medium of claim 12, wherein determining the first probability comprises accessing at least one data structure, the at least one data structure comprising the first term and the first probability that the at least one term is associated with the at least one of the plurality of classification types.
  - 16. The computer readable medium of claim 15, wherein the at least one data structure is at least one dictionary generated by a training module.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Athenahealth Incorporated
Original Assignee
Athenahealth Incorporated
Inventors
Amar, Anshul, Sallaska Nye, JoRel
Primary Examiner(s)
Jalil, Neveen Abel
Assistant Examiner(s)
BUCKINGHAM, KELLYE DEE

Application Number

US12/138,181
Publication Number

US 20090313194A1
Time in Patent Office

2,098 Days
Field of Search

None
US Class Current

707/780
CPC Class Codes

G06F 16/353 into predefined classes

Methods and apparatus for automated image classification

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

35 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for automated image classification

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

35 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links