Unified text analytics annotator development life cycle combining rule-based and machine learning based techniques
First Claim
1. A method for developing a text analytics program for extracting at least one target concept comprising:
- utilizing at least one processor to execute computer code that performs the steps of;
initiating a development tool that accepts user input to develop rules for extraction of features of the at least one target concept within a dataset comprising textual information;
developing, using the rules for feature extraction, an evaluation dataset comprising at least one document annotated with the at least one target concept to be extracted by the text analytics program;
creating, using the rules for feature extraction, a rule-based annotator to extract the at least one target concept;
training, using the evaluation dataset, a machine-learning annotator to extract the at least one target concept within the dataset;
evaluating each of the rule-based annotator and the machine-learning annotator against the evaluation dataset and comparing the extraction results, of each of the rule-based annotator and the machine-learning annotator, from the evaluation against a threshold for accuracy;
combining, responsive to determining each of the rule-based annotator and the machine-learning annotator meet the threshold for accuracy, the rule-based annotator and the machine-learning annotator to form a combined annotator having features from both of the rule-based annotator and the machine-learning annotator;
evaluating, using the evaluation dataset, extraction performance of the combined annotator against a predetermined threshold; and
publishing, when the extraction performance of the combined annotator exceeds the predetermined threshold, the combined annotator for use in an application that extracts the at least one target concept from a plurality of datasets.
1 Assignment
0 Petitions
Accused Products
Abstract
One embodiment provides a method for developing a text analytics program for extracting at least one target concept including: utilizing at least one processor to execute computer code that performs the steps of: initiating a development tool that accepts user input to develop rules for extraction of features of the at least one target concept within a dataset comprising textual information; developing, using the rules for feature extraction, an evaluation dataset comprising at least one document annotated with the at least one target concept to be extracted by the text analytics program; creating, using the rules for feature extraction, a rule-based annotator to extract the at least one target concept; training, using the evaluation dataset, a machine-learning annotator to extract the at least one target concept within the dataset; combining the rule-based annotator and the machine learning annotator to form a combined annotator; evaluating, using the evaluation dataset, extraction performance of the combined annotator against a predetermined threshold; and publishing, when the extraction performance of the combined annotator exceeds the predetermined threshold, the combined annotator for use in an application that extracts the at least one target concept from a plurality of datasets.
-
Citations
19 Claims
-
1. A method for developing a text analytics program for extracting at least one target concept comprising:
-
utilizing at least one processor to execute computer code that performs the steps of; initiating a development tool that accepts user input to develop rules for extraction of features of the at least one target concept within a dataset comprising textual information; developing, using the rules for feature extraction, an evaluation dataset comprising at least one document annotated with the at least one target concept to be extracted by the text analytics program; creating, using the rules for feature extraction, a rule-based annotator to extract the at least one target concept; training, using the evaluation dataset, a machine-learning annotator to extract the at least one target concept within the dataset; evaluating each of the rule-based annotator and the machine-learning annotator against the evaluation dataset and comparing the extraction results, of each of the rule-based annotator and the machine-learning annotator, from the evaluation against a threshold for accuracy; combining, responsive to determining each of the rule-based annotator and the machine-learning annotator meet the threshold for accuracy, the rule-based annotator and the machine-learning annotator to form a combined annotator having features from both of the rule-based annotator and the machine-learning annotator; evaluating, using the evaluation dataset, extraction performance of the combined annotator against a predetermined threshold; and publishing, when the extraction performance of the combined annotator exceeds the predetermined threshold, the combined annotator for use in an application that extracts the at least one target concept from a plurality of datasets. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. An apparatus for developing a text analytics program for extracting at least one target concept, comprising:
-
at least one processor, and a computer readable storage medium having computer readable program code embodied therewith and executable by the at least one processor, the computer readable program code comprising; computer readable program code that initiates a development tool that accepts user input to develop rules for extraction of features of the at least one target concept within a dataset comprising textual information; computer readable program code that develops, using the rules for feature extraction, an evaluation dataset comprising at least one document annotated with the at least one target concept to be extracted by the text analytics program; computer readable program code that creates, using the rules for feature extraction, a rule-based annotator to extract the at least one target concept; computer readable program code that trains, using the evaluation dataset, a machine-learning annotator to extract the at least one target concept within the dataset; computer readable program code that evaluates each of the rule-based annotator and the machine-learning annotator against the evaluation dataset and compares the extraction results, of each of the rule-based annotator and the machine-learning annotator, from the evaluation against a threshold for accuracy; computer readable program code that combines, responsive to determining each of the rule-based annotator and the machine-learning annotator meet the threshold for accuracy, the rule-based annotator and the machine-learning annotator to form a combined annotator having features from both of the rule-based annotator and the machine-learning annotator; computer readable program code that evaluates, using the evaluation dataset, extraction performance of the combined annotator against a predetermined threshold; and computer readable program code that publishes, when the extraction performance of the combined annotator exceeds the predetermined threshold, the combined annotator for use in an application that extracts the at least one target concept from a plurality of datasets.
-
-
12. A computer program product comprising:
-
a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code executable by a processor and comprising; computer readable program code that initiates a development tool that accepts user input to develop rules for extraction of features of the at least one target concept within a dataset comprising textual information; computer readable program code that develops, using the rules for feature extraction, an evaluation dataset comprising at least one document annotated with the at least one target concept to be extracted by the text analytics program; computer readable program code that creates, using the rules for feature extraction, a rule-based annotator to extract the at least one target concept; computer readable program code that trains, using the evaluation dataset, a machine-learning annotator to extract the at least one target concept within the dataset; computer readable program code that evaluates each of the rule-based annotator and the machine-learning annotator against the evaluation dataset and compares the extraction results, of each of the rule-based annotator and the machine-learning annotator, from the evaluation against a threshold for accuracy; computer readable program code that combines, responsive to determining each of the rule-based annotator and the machine-learning annotator meet the threshold for accuracy, the rule-based annotator and the machine-learning annotator to form a combined annotator having features from both of the rule-based annotator and the machine-learning annotator; computer readable program code that evaluates, using the evaluation dataset, extraction performance of the combined annotator against a predetermined threshold; and computer readable program code that publishes, when the extraction performance of the combined annotator exceeds the predetermined threshold, the combined annotator for use in an application that extracts the at least one target concept from a plurality of datasets. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
Specification