Unified text analytics annotator development life cycle combining rule-based and machine learning based techniques

US 10,289,963 B2
Filed: 02/27/2017
Issued: 05/14/2019
Est. Priority Date: 02/27/2017
Status: Active Grant

First Claim

Patent Images

1. A method for developing a text analytics program for extracting at least one target concept comprising:

utilizing at least one processor to execute computer code that performs the steps of;

initiating a development tool that accepts user input to develop rules for extraction of features of the at least one target concept within a dataset comprising textual information;

developing, using the rules for feature extraction, an evaluation dataset comprising at least one document annotated with the at least one target concept to be extracted by the text analytics program;

creating, using the rules for feature extraction, a rule-based annotator to extract the at least one target concept;

training, using the evaluation dataset, a machine-learning annotator to extract the at least one target concept within the dataset;

evaluating each of the rule-based annotator and the machine-learning annotator against the evaluation dataset and comparing the extraction results, of each of the rule-based annotator and the machine-learning annotator, from the evaluation against a threshold for accuracy;

combining, responsive to determining each of the rule-based annotator and the machine-learning annotator meet the threshold for accuracy, the rule-based annotator and the machine-learning annotator to form a combined annotator having features from both of the rule-based annotator and the machine-learning annotator;

evaluating, using the evaluation dataset, extraction performance of the combined annotator against a predetermined threshold; and

publishing, when the extraction performance of the combined annotator exceeds the predetermined threshold, the combined annotator for use in an application that extracts the at least one target concept from a plurality of datasets.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

One embodiment provides a method for developing a text analytics program for extracting at least one target concept including: utilizing at least one processor to execute computer code that performs the steps of: initiating a development tool that accepts user input to develop rules for extraction of features of the at least one target concept within a dataset comprising textual information; developing, using the rules for feature extraction, an evaluation dataset comprising at least one document annotated with the at least one target concept to be extracted by the text analytics program; creating, using the rules for feature extraction, a rule-based annotator to extract the at least one target concept; training, using the evaluation dataset, a machine-learning annotator to extract the at least one target concept within the dataset; combining the rule-based annotator and the machine learning annotator to form a combined annotator; evaluating, using the evaluation dataset, extraction performance of the combined annotator against a predetermined threshold; and publishing, when the extraction performance of the combined annotator exceeds the predetermined threshold, the combined annotator for use in an application that extracts the at least one target concept from a plurality of datasets.

Citations

19 Claims

1. A method for developing a text analytics program for extracting at least one target concept comprising:
- utilizing at least one processor to execute computer code that performs the steps of;
  
  initiating a development tool that accepts user input to develop rules for extraction of features of the at least one target concept within a dataset comprising textual information;
  
  developing, using the rules for feature extraction, an evaluation dataset comprising at least one document annotated with the at least one target concept to be extracted by the text analytics program;
  
  creating, using the rules for feature extraction, a rule-based annotator to extract the at least one target concept;
  
  training, using the evaluation dataset, a machine-learning annotator to extract the at least one target concept within the dataset;
  
  evaluating each of the rule-based annotator and the machine-learning annotator against the evaluation dataset and comparing the extraction results, of each of the rule-based annotator and the machine-learning annotator, from the evaluation against a threshold for accuracy;
  
  combining, responsive to determining each of the rule-based annotator and the machine-learning annotator meet the threshold for accuracy, the rule-based annotator and the machine-learning annotator to form a combined annotator having features from both of the rule-based annotator and the machine-learning annotator;
  
  evaluating, using the evaluation dataset, extraction performance of the combined annotator against a predetermined threshold; and
  
  publishing, when the extraction performance of the combined annotator exceeds the predetermined threshold, the combined annotator for use in an application that extracts the at least one target concept from a plurality of datasets.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein, in response to the extraction performance of the combined annotator being below the predetermined threshold, performing one of the following:
    - iterating through one or more of the initiating, developing, creating, training, combining, and evaluating to form a refined, combined annotator.
  - 3. The method of claim 1, wherein one or more of the rule-based annotator and the machine learning annotator are based on one or more pre-existing annotators that is available from a catalog of annotators stored in memory.
  - 4. The method of claim 1, wherein the training dataset is annotated using one or more pre-existing annotators available from a catalog of annotators stored in memory.
  - 5. The method of claim 1, wherein the evaluation dataset comprises a testing dataset and a training dataset.
  - 6. The method of claim 5, wherein the training a machine-learning annotator comprises using the training dataset.
  - 7. The method of claim 5, wherein the testing dataset is reserved for testing an annotator selected from the group consisting of the rule-based annotator and the machine learning annotator.
  - 8. The method of claim 1, wherein the user input comprises at least one rule selected from the group consisting of:
    - a regular expression, a dictionary of terms, parts of speech, and simple patterns.
  - 9. The method of claim 1, comprising providing a user interface for creating the combined annotator.
  - 10. The method of claim 9, wherein the user interface comprises graphical display elements facilitating formation of the combined annotator.

11. An apparatus for developing a text analytics program for extracting at least one target concept, comprising:
- at least one processor, anda computer readable storage medium having computer readable program code embodied therewith and executable by the at least one processor, the computer readable program code comprising;
  
  computer readable program code that initiates a development tool that accepts user input to develop rules for extraction of features of the at least one target concept within a dataset comprising textual information;
  
  computer readable program code that develops, using the rules for feature extraction, an evaluation dataset comprising at least one document annotated with the at least one target concept to be extracted by the text analytics program;
  
  computer readable program code that creates, using the rules for feature extraction, a rule-based annotator to extract the at least one target concept;
  
  computer readable program code that trains, using the evaluation dataset, a machine-learning annotator to extract the at least one target concept within the dataset;
  
  computer readable program code that evaluates each of the rule-based annotator and the machine-learning annotator against the evaluation dataset and compares the extraction results, of each of the rule-based annotator and the machine-learning annotator, from the evaluation against a threshold for accuracy;
  
  computer readable program code that combines, responsive to determining each of the rule-based annotator and the machine-learning annotator meet the threshold for accuracy, the rule-based annotator and the machine-learning annotator to form a combined annotator having features from both of the rule-based annotator and the machine-learning annotator;
  
  computer readable program code that evaluates, using the evaluation dataset, extraction performance of the combined annotator against a predetermined threshold; and
  
  computer readable program code that publishes, when the extraction performance of the combined annotator exceeds the predetermined threshold, the combined annotator for use in an application that extracts the at least one target concept from a plurality of datasets.

12. A computer program product comprising:
- a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code executable by a processor and comprising;
  
  computer readable program code that initiates a development tool that accepts user input to develop rules for extraction of features of the at least one target concept within a dataset comprising textual information;
  
  computer readable program code that develops, using the rules for feature extraction, an evaluation dataset comprising at least one document annotated with the at least one target concept to be extracted by the text analytics program;
  
  computer readable program code that creates, using the rules for feature extraction, a rule-based annotator to extract the at least one target concept;
  
  computer readable program code that trains, using the evaluation dataset, a machine-learning annotator to extract the at least one target concept within the dataset;
  
  computer readable program code that evaluates each of the rule-based annotator and the machine-learning annotator against the evaluation dataset and compares the extraction results, of each of the rule-based annotator and the machine-learning annotator, from the evaluation against a threshold for accuracy;
  
  computer readable program code that combines, responsive to determining each of the rule-based annotator and the machine-learning annotator meet the threshold for accuracy, the rule-based annotator and the machine-learning annotator to form a combined annotator having features from both of the rule-based annotator and the machine-learning annotator;
  
  computer readable program code that evaluates, using the evaluation dataset, extraction performance of the combined annotator against a predetermined threshold; and
  
  computer readable program code that publishes, when the extraction performance of the combined annotator exceeds the predetermined threshold, the combined annotator for use in an application that extracts the at least one target concept from a plurality of datasets.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The computer program product of claim 12, wherein, responsive to the extraction performance of the combined annotator being below the predetermined threshold, performing one of the following:
    - iterating through one or more of the initiating, developing, creating, training, combining, and evaluating to form a refined, combined annotator.
  - 14. The computer program product of claim 12, wherein one or more of the rule-based annotator and the machine learning annotator are based on one or more pre-existing annotators available from a catalog of annotators stored in memory.
  - 15. The computer program product of claim 12, wherein the training dataset is annotated using one or more pre-existing annotators that is available from a catalog of annotators stored in memory.
  - 16. The computer program product of claim 12, wherein the evaluation dataset comprises a testing dataset and a training dataset and wherein the training a machine-learning annotator comprises using the training dataset.
  - 17. The computer program product of claim 12, wherein the evaluation dataset comprises a testing dataset and a training dataset and wherein the testing dataset is reserved for testing an annotator selected from the group consisting of the rule-based annotator and the machine learning annotator.
  - 18. The computer program product of claim 12, wherein the user input comprises at least one rule selected from the group consisting of:
    - a regular expression, a dictionary of terms, parts of speech, and simple patterns.
  - 19. The computer program product of claim 12, comprising providing a user interface for creating the combined annotator.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Chiticariu, Laura, Kreulen, Jeffrey Thomas, Krishnamurthy, Rajasekar, Sen, Prithviraj, Vaithyanathan, Shivakumar
Primary Examiner(s)
Huynh, Cong-Lac

Application Number

US15/444,051
Publication Number

US 20180246867A1
Time in Patent Office

806 Days
Field of Search

715200, 715230
US Class Current
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 16/367   Ontology

G06F 40/169   Annotation, e.g. comment da...

G06N 20/00   Machine learning

G06N 5/025   Extracting rules from data

Unified text analytics annotator development life cycle combining rule-based and machine learning based techniques

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Unified text analytics annotator development life cycle combining rule-based and machine learning based techniques

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links