Corpus quality analysis
First Claim
1. A method, in a data processing system comprising a processor and a memory, the memory comprising instructions executed by the processor to specifically configure the processor to implement a corpus quality analysis system for corpus quality analysis, the method comprising:
- applying, by the corpus quality analysis system, at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplement existing corpora for performing a natural language processing (NLP) operation, wherein the at least one filter comprises a first filter to determine whether documents in the candidate corpus contain NLP features known to be helpful for performing the NLP operation, wherein applying the first filter comprises;
extracting a set of the most frequent NLP features from a combination of the current corpora and candidate corpus;
examining evidence and candidate answers for questions answered correctly and incorrectly using the combination of the current corpora and candidate corpus;
determining a set most effective features from the set of the most frequent NLP feature using a machine learning model based on the evidence and candidate answers;
determining a number of the set of most effective features that are present in the candidate corpus; and
comparing the number of the set of most effective features that are present in the candidate corpus to the set of prerequisites for adding the candidate corpus to the existing corpora;
responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, adding, by the corpus quality analysis system, the candidate corpus to the existing corpora to form modified corpora; and
performing, by a question answering system executing in the data processing system, the NLP operation using the modified corpora.
1 Assignment
0 Petitions
Accused Products
Abstract
A mechanism is provided in a data processing system for corpus quality analysis. The mechanism applies at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplements existing corpora for performing a natural language processing (NLP) operation. Responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, the mechanism adds the candidate corpus to the existing corpora to form modified corpora. The mechanism performs the NLP operation using the modified corpora.
-
Citations
20 Claims
-
1. A method, in a data processing system comprising a processor and a memory, the memory comprising instructions executed by the processor to specifically configure the processor to implement a corpus quality analysis system for corpus quality analysis, the method comprising:
-
applying, by the corpus quality analysis system, at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplement existing corpora for performing a natural language processing (NLP) operation, wherein the at least one filter comprises a first filter to determine whether documents in the candidate corpus contain NLP features known to be helpful for performing the NLP operation, wherein applying the first filter comprises; extracting a set of the most frequent NLP features from a combination of the current corpora and candidate corpus; examining evidence and candidate answers for questions answered correctly and incorrectly using the combination of the current corpora and candidate corpus; determining a set most effective features from the set of the most frequent NLP feature using a machine learning model based on the evidence and candidate answers; determining a number of the set of most effective features that are present in the candidate corpus; and comparing the number of the set of most effective features that are present in the candidate corpus to the set of prerequisites for adding the candidate corpus to the existing corpora; responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, adding, by the corpus quality analysis system, the candidate corpus to the existing corpora to form modified corpora; and performing, by a question answering system executing in the data processing system, the NLP operation using the modified corpora. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer program product comprising a computer readable storage medium having a computer readable program stored therein, wherein the computer readable program, when executed on a question answering system, causes the question answering system to implement a corpus quality analysis system for corpus quality analysis, wherein the computer readable program causes the data processing system to:
-
apply, by the corpus quality analysis system, at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplements existing corpora for performing a natural language processing (NLP) operation, wherein the at least one filter comprises a first filter to determine whether documents in the candidate corpus contain NLP features known to be helpful for performing the NLP operation, wherein applying the first filter comprises; extracting a set of the most frequent NLP features from a combination of the current corpora candidate corpus; examining evidence and candidate answers for questions answered correctly and incorrectly using the combination of the current corpora and candidate corpus; determining a set of most effective features from the set of the most frequent NLP features using a machine learning model based on the evidence and candidate answers; determining a number of the set of most effective features that are present in the candidate corpus; and comparing the number of the set of most effective features that are present in the candidate corpus to the set of prerequisites for adding the candidate corpus to the existing corpora; responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, add, by the corpus quality analysis system, the candidate corpus to the existing corpora to form modified corpora; and perform, by a question answering system executing in the data processing system, the NLP operation using the modified corpora. - View Dependent Claims (10, 11, 13, 14, 15)
-
-
12. An apparatus comprising:
-
a processor; and a memory coupled to the processor, wherein the memory comprises instructions which, when executed by the processor, cause the processor to implement a corpus quality analysis system for corpus quality analysis, wherein the instructions cause the processor to; apply, by the corpus quality analysis system, at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplements existing corpora for performing a natural language processing (NLP) operation, wherein the at least one filter comprises a first filter to determine whether documents in the candidate corpus contain NLP features known to be helpful for performing the NLP operation, wherein applying the first filter comprises; extracting a set of the most frequent NLP features from a combination of the current corpora and candidate corpus; examining evidence and candidate answers for questions answered correctly and incorrectly using the combination of the current corpora and candidate corpus; determining a set of most effective features from the set of the most frequent NLP features using a machine learning model based on the evidence and candidate answers; determining a number of the set of most effective features that are present in the candidate corpus; and comparing the number of the set of most effective features that are present in the candidate corpus to the set of prerequisites for adding the candidate corpus to the existing corpora; responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, add, by the corpus quality analysis system, the candidate corpus to the existing corpora to form modified corpora; and perform, by a question answering system executing in the data processing system, the NLP operation using the modified corpora. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification