Methods and apparatus for automated image classification
First Claim
1. A method of processing an unclassified electronic document image comprising information associated with a healthcare entity, the method comprising:
- converting the image to a textual representation;
identifying at least one term in the textual representation, wherein the at least one term is represented in training data indicating a degree of association between the at least one term and a plurality of document classifications, wherein the training data includes information about a plurality of terms extracted from a plurality of documents during training;
determining, with at least one computer processor, for a first document classification of the plurality of document classifications, a first probability that the unclassified electronic document image belongs to the first document classification, wherein determining the first probability comprises for each term of the at least one term, multiplying a value from the training data indicating the degree of association between the term and the first document classification by an initial probability representing a percentage of historical documents that were classified using the first document classification, wherein the initial probability is scaled by a number of times the term appears in the textual representation;
assigning to the unclassified electronic document image, a document classification to produce a classified electronic document image, wherein the document classification is assigned based, at least in part, on the determined first probability;
associating a confidence score with the classified electronic document image, wherein the confidence score is determined based, at least in part, on a plurality of classification type probability values each indicating a likelihood that the classified electronic document image is associated with one of the plurality of document classifications; and
determining that the classified electronic document image was accurately classified if the confidence score exceeds a predetermined threshold value.
9 Assignments
0 Petitions
Accused Products
Abstract
A system for automated classification of an image of an electronic document such as a facsimile document. The image is converted to a textual representation, and at least some of the terms in the textual representation may be associated with one or more predefined classification types, thereby enabling the document to be classified, and for multi-page documents, determining boundaries used to split the document into sections. The development of associations between terms and classification types may result from providing, to the system, a training set of manually-classified documents. A training module analyzes the training set to calculate probabilities that particular terms may appear in documents of a particular classification type. Probabilities established during training are used during automated document processing to assign a classification type to the document. A confidence score associated with the assigned classification type provides a metric for assessing the accuracy of the automated process.
35 Citations
16 Claims
-
1. A method of processing an unclassified electronic document image comprising information associated with a healthcare entity, the method comprising:
-
converting the image to a textual representation; identifying at least one term in the textual representation, wherein the at least one term is represented in training data indicating a degree of association between the at least one term and a plurality of document classifications, wherein the training data includes information about a plurality of terms extracted from a plurality of documents during training; determining, with at least one computer processor, for a first document classification of the plurality of document classifications, a first probability that the unclassified electronic document image belongs to the first document classification, wherein determining the first probability comprises for each term of the at least one term, multiplying a value from the training data indicating the degree of association between the term and the first document classification by an initial probability representing a percentage of historical documents that were classified using the first document classification, wherein the initial probability is scaled by a number of times the term appears in the textual representation; assigning to the unclassified electronic document image, a document classification to produce a classified electronic document image, wherein the document classification is assigned based, at least in part, on the determined first probability; associating a confidence score with the classified electronic document image, wherein the confidence score is determined based, at least in part, on a plurality of classification type probability values each indicating a likelihood that the classified electronic document image is associated with one of the plurality of document classifications; and determining that the classified electronic document image was accurately classified if the confidence score exceeds a predetermined threshold value. - View Dependent Claims (2, 3, 4)
-
-
5. A method of classifying an image of an unclassified electronic document comprising information associated with a healthcare entity, the method comprising:
-
generating, from the image, a data structure comprising at least one term appearing in the image; determining a term frequency for the at least one term, wherein the term frequency indicates a number of times that the at least one term appears in the image of the unclassified electronic document; determining, for a first document classification of a plurality of document classifications, a first probability that the unclassified electronic document image belongs to the first document classification, wherein determining the first probability comprises for each term of the at least one term, multiplying a value from training data indicating the degree of association between the term and the first document classification by an initial probability representing a percentage of historical documents that were classified using the first document classification, wherein the initial probability is scaled by the term frequency for the term; determining, for second first document classification of the plurality of document classifications, a second probability that the unclassified electronic document image belongs to the second document classification, wherein determining the second probability comprises for each term of the at least one term, multiplying a value from training data indicating the degree of association between the term and the second document classification by an initial probability representing a percentage of historical documents that were classified using the second document classification, wherein the initial probability is scaled by the term frequency for the term; determining at least one classification probability for the image of the unclassified electronic document, the at least one classification probability determined based, at least in part on, the determined first probability and the determined second probability; selecting a classification based, at least in part, on the first probability and the second probability; and assigning the classification to the image of the unclassified electronic document to produce a classified electronic document image. - View Dependent Claims (6, 7, 8, 9, 10, 11)
-
-
12. A computer-readable storage medium, encoded with a series of instructions, that when executed on a computer, perform a method of classifying an unclassified electronic document image according to a plurality of document classifications, wherein the unclassified electronic document image comprises information associated with a healthcare entity, the method comprising:
parsing the unclassified electronic document image to identify a first term and a second term in the unclassified electronic document image; determining, for a first document classification of the plurality of document classifications, a first probability that the unclassified electronic document image belongs to the first document classification, wherein determining the first probability comprises multiplying a value from training data indicating the degree of association between the first term and the first document classification by an initial probability representing a percentage of historical documents that were classified using the first document classification, wherein the initial probability is scaled by a number of times the first term appears in the textual representation; determining, for the first document classification of the plurality of document classifications, a second probability that the unclassified electronic document image belongs to the first document classification, wherein determining the second probability comprises multiplying a value from training data indicating the degree of association between the second term and the first document classification by an initial probability representing a percentage of historical documents that were classified using the first document classification, wherein the initial probability is scaled by a number of times the second term appears in the textual representation; determining a combined probability based, at least in part, on the first probability and the second probability; and assigning a classification to the unclassified electronic document image to produce a classified electronic document image, wherein the classification is assigned based, at least in part, on the determined combined probability. - View Dependent Claims (13, 14, 15, 16)
Specification