Methods and systems for classifying data using a hierarchical taxonomy
First Claim
1. A computer-implemented method including executing instructions stored on a computer-readable medium, the method comprising:
- generating a set of document classifiers by applying a classification algorithm to a trusted corpus, wherein the trusted corpus includes a set of training documents representing a hierarchical taxonomy, the hierarchical taxonomy including a hierarchical tree structure of domain specific issues that includes multiple levels of issue categories, subcategories, and sub issues of each issue, the trusted corpus further includes previously classified documents associated with a classification confidence level above a predetermined confidence level threshold;
executing one or more of the generated document classifiers against a first plurality of input documents to create a first plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level;
selecting one or more classified documents that are associated with a classification confidence level below the predetermined confidence level threshold to create a set of low-confidence documents;
disassociating the low-confidence documents from each of the associated classifications;
prompting a user to enter a new classification within the hierarchical taxonomy for at least one low-confidence document, wherein the low-confidence document is associated with the entered classification and with a predetermined confidence level to create a newly classified document in at least one of the multiple levels of issue categories, subcategories, and sub issues of each issue of the hierarchical taxonomy;
applying a highest classification confidence level to the newly classified document;
including the newly classified document in the trusted corpus to create an updated trusted corpus; and
executing one or more of the generated document classifiers, by applying the classification algorithm to the updated trusted corpus against a second plurality of input documents to create a second plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for classifying documents is provided. A set of document classifiers is generated by applying a classification algorithm to a trusted corpus that includes a set of training documents representing a taxonomy. One or more of the generated document classifiers are executed against a plurality of input documents to create a plurality of classified documents. Each classified document is associated with a classification within the taxonomy and a classification confidence level. One or more classified documents that are associated with a classification confidence level below a predetermined threshold value are selected to create a set of low-confidence documents. The low-confidence documents are disassociated from each of the associated classifications. A user is prompted to enter a classification within the taxonomy for at least one low-confidence document. The low-confidence document is associated with the entered classification and with a predetermined confidence level to create a newly classified document.
65 Citations
20 Claims
-
1. A computer-implemented method including executing instructions stored on a computer-readable medium, the method comprising:
-
generating a set of document classifiers by applying a classification algorithm to a trusted corpus, wherein the trusted corpus includes a set of training documents representing a hierarchical taxonomy, the hierarchical taxonomy including a hierarchical tree structure of domain specific issues that includes multiple levels of issue categories, subcategories, and sub issues of each issue, the trusted corpus further includes previously classified documents associated with a classification confidence level above a predetermined confidence level threshold; executing one or more of the generated document classifiers against a first plurality of input documents to create a first plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level; selecting one or more classified documents that are associated with a classification confidence level below the predetermined confidence level threshold to create a set of low-confidence documents; disassociating the low-confidence documents from each of the associated classifications; prompting a user to enter a new classification within the hierarchical taxonomy for at least one low-confidence document, wherein the low-confidence document is associated with the entered classification and with a predetermined confidence level to create a newly classified document in at least one of the multiple levels of issue categories, subcategories, and sub issues of each issue of the hierarchical taxonomy; applying a highest classification confidence level to the newly classified document; including the newly classified document in the trusted corpus to create an updated trusted corpus; and executing one or more of the generated document classifiers, by applying the classification algorithm to the updated trusted corpus against a second plurality of input documents to create a second plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer system comprising:
-
a memory for storing a trusted corpus, wherein the trusted corpus includes a set of training documents representing a hierarchical taxonomy, the hierarchical taxonomy including a hierarchical tree structure of domain specific issues that includes multiple levels of issue categories, subcategories, and sub issues of each issue, the trusted corpus further includes previously classified documents associated with a classification confidence level above a predetermined confidence level threshold; and a processor coupled to the memory and programmed to; generate a set of document classifiers by applying a classification algorithm to the trusted corpus; execute one or more of the generated document classifiers against a first plurality of input documents to create a first plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level; select one or more classified documents that are associated with a classification confidence level below the predetermined confidence level threshold to create a set of low-confidence documents; disassociate the low-confidence documents from each of the associated classifications; prompt a user to enter a classification within the hierarchical taxonomy for at least one low-confidence document, wherein the low-confidence document is associated with the entered classification and with a predetermined confidence level to create a newly classified document; apply a highest classification confidence level to the newly classified document; include the newly classified document in the trusted corpus to create an updated trusted corpus; and execute one or more of the generated document classifiers, by applying the classification algorithm to the updated trusted corpus against a second plurality of input documents to create a second plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. One or more non-transitory computer-readable media having computer-executable instructions embodied thereon, wherein when executed by a computing device, the computer-executable instructions cause the computing device to:
-
generate a set of document classifiers by applying a classification algorithm to a trusted corpus, wherein the trusted corpus includes a set of training documents representing a hierarchical taxonomy, the hierarchical taxonomy including a hierarchical tree structure of domain specific issues that includes multiple levels of issue categories, subcategories, and sub issues of each issue, the trusted corpus further includes previously classified documents associated with a classification confidence level above a predetermined confidence level threshold; execute one or more of the generated document classifiers against a first plurality of input documents to create a first plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level; select one or more classified documents that are associated with a classification confidence level below the predetermined confidence level threshold to create a set of low-confidence documents; disassociate the low-confidence documents from each of the associated classifications; prompt a user to enter a classification within the hierarchical taxonomy for at least one low-confidence document, wherein the low-confidence document is associated with the entered classification and with a predetermined confidence level to create a newly classified document; apply a highest classification confidence level to the newly classified document; include the newly classified document in the trusted corpus to create an updated trusted corpus; and execute one or more of the generated document classifiers, by applying the classification algorithm to the updated trusted corpus against a second plurality of input documents to create a second plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification