Corpus quality analysis
First Claim
Patent Images
1. A method, in a data processing system, for corpus quality analysis, the method comprising:
- applying at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplements existing corpora for performing a natural language processing (NLP) operation, wherein the at least one filter comprises a first filter to determine whether the candidate corpus contains documents having attributes that match a set of evidence documents that are known to provide high-confidence evidence and contains documents that cover a set of questions not sufficiently covered by the current corpora, wherein applying the first filter comprises;
determining desired features associated with high-confidence evidence documents within the current corpora;
determining misleading features associated with misleading documents within the current corpora, wherein the misleading documents provide evidence for low-confidence or incorrect answers;
determining a fraction of documents in the candidate corpus that match the desired features and a fraction of documents in the candidate corpus that match the misleading features; and
comparing the fraction of documents in the candidate corpus that match the desired features and the fraction of documents in the candidate corpus that match the misleading features to a set of prerequisites for adding the candidate corpus to the existing corpora;
responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, adding the candidate corpus to the existing corpora to form modified corpora; and
performing the NLP operation using the modified corpora.
1 Assignment
0 Petitions
Accused Products
Abstract
A mechanism is provided in a data processing system for corpus quality analysis. The mechanism applies at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplements existing corpora for performing a natural language processing (NLP) operation. Responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, the mechanism adds the candidate corpus to the existing corpora to form modified corpora. The mechanism performs the NLP operation using the modified corpora.
46 Citations
20 Claims
-
1. A method, in a data processing system, for corpus quality analysis, the method comprising:
-
applying at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplements existing corpora for performing a natural language processing (NLP) operation, wherein the at least one filter comprises a first filter to determine whether the candidate corpus contains documents having attributes that match a set of evidence documents that are known to provide high-confidence evidence and contains documents that cover a set of questions not sufficiently covered by the current corpora, wherein applying the first filter comprises; determining desired features associated with high-confidence evidence documents within the current corpora; determining misleading features associated with misleading documents within the current corpora, wherein the misleading documents provide evidence for low-confidence or incorrect answers; determining a fraction of documents in the candidate corpus that match the desired features and a fraction of documents in the candidate corpus that match the misleading features; and comparing the fraction of documents in the candidate corpus that match the desired features and the fraction of documents in the candidate corpus that match the misleading features to a set of prerequisites for adding the candidate corpus to the existing corpora; responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, adding the candidate corpus to the existing corpora to form modified corpora; and performing the NLP operation using the modified corpora. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer program product comprising a computer readable storage medium having a computer readable program stored therein, wherein the computer readable program, when executed on a question answering system, causes the question answering system to:
apply at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplements existing corpora for performing a natural language processing (NLP) operation, wherein the at least one filter comprises a first filter to determine whether the candidate corpus contains documents having attributes that match a set of evidence documents that are known to provide high-confidence evidence and contains documents that cover a set of questions not sufficiently covered by the current corpora, wherein applying the first filter comprises; determining desired features associated with high-confidence evidence documents within the current corpora; determining misleading features associated with misleading documents within the current corpora, wherein the misleading documents provide evidence for low-confidence or incorrect answers; determining a fraction of documents in the candidate corpus that match the desired features and a fraction of documents in the candidate corpus that match the misleading features; and comparing the fraction of documents in the candidate corpus that match the desired features and the fraction of documents in the candidate corpus that match the misleading features to a set of prerequisites for adding the candidate corpus to the existing corpora; responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, add the candidate corpus to the existing corpora to form modified corpora; and perform the NLP operation using the modified corpora. - View Dependent Claims (14, 15, 16, 17, 18)
-
19. An apparatus comprising:
-
a processor; and a memory coupled to the processor, wherein the memory comprises instructions which, when executed by the processor, cause the processor to; apply at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplements existing corpora for performing a natural language processing (NLP) operation, wherein the at least one filter comprises a second filter to determine whether the candidate corpus contains documents having attributes that match a set of evidence documents that are known to provide high-confidence evidence and contains documents that cover a set of questions not sufficiently covered by the current corpora and wherein applying the second filter comprises; determining desired features associated with high-confidence evidence documents within the current corpora; determining misleading features associated with misleading documents within the current corpora, wherein the misleading documents provide evidence for low-confidence or incorrect answers; determining a fraction of documents in the candidate corpus that match the desired features and a fraction of documents in the candidates corpus that match the misleading features; and comparing the fraction of documents in the candidate corpus that match the desired features and the fraction of documents in the candidate corpus that match the misleading features to a set of prerequisites for adding the candidate corpus to the existing corpora; responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, add the candidate corpus to the existing corpora to form modified corpora; and perform the NLP operation using the modified corpora. - View Dependent Claims (20)
-
Specification