Corpus quality analysis

US 9,754,207 B2
Filed: 07/28/2014
Issued: 09/05/2017
Est. Priority Date: 07/28/2014
Status: Active Grant

First Claim

Patent Images

1. A method, in a data processing system, for corpus quality analysis, the method comprising:

applying at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplements existing corpora for performing a natural language processing (NLP) operation, wherein the at least one filter comprises a first filter to determine whether the candidate corpus contains documents having attributes that match a set of evidence documents that are known to provide high-confidence evidence and contains documents that cover a set of questions not sufficiently covered by the current corpora, wherein applying the first filter comprises;

determining desired features associated with high-confidence evidence documents within the current corpora;

determining misleading features associated with misleading documents within the current corpora, wherein the misleading documents provide evidence for low-confidence or incorrect answers;

determining a fraction of documents in the candidate corpus that match the desired features and a fraction of documents in the candidate corpus that match the misleading features; and

comparing the fraction of documents in the candidate corpus that match the desired features and the fraction of documents in the candidate corpus that match the misleading features to a set of prerequisites for adding the candidate corpus to the existing corpora;

responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, adding the candidate corpus to the existing corpora to form modified corpora; and

performing the NLP operation using the modified corpora.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A mechanism is provided in a data processing system for corpus quality analysis. The mechanism applies at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplements existing corpora for performing a natural language processing (NLP) operation. Responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, the mechanism adds the candidate corpus to the existing corpora to form modified corpora. The mechanism performs the NLP operation using the modified corpora.

46 Citations

View as Search Results

20 Claims

1. A method, in a data processing system, for corpus quality analysis, the method comprising:
- applying at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplements existing corpora for performing a natural language processing (NLP) operation, wherein the at least one filter comprises a first filter to determine whether the candidate corpus contains documents having attributes that match a set of evidence documents that are known to provide high-confidence evidence and contains documents that cover a set of questions not sufficiently covered by the current corpora, wherein applying the first filter comprises;
  
  determining desired features associated with high-confidence evidence documents within the current corpora;
  
  determining misleading features associated with misleading documents within the current corpora, wherein the misleading documents provide evidence for low-confidence or incorrect answers;
  
  determining a fraction of documents in the candidate corpus that match the desired features and a fraction of documents in the candidate corpus that match the misleading features; and
  
  comparing the fraction of documents in the candidate corpus that match the desired features and the fraction of documents in the candidate corpus that match the misleading features to a set of prerequisites for adding the candidate corpus to the existing corpora;
  
  responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, adding the candidate corpus to the existing corpora to form modified corpora; and
  
  performing the NLP operation using the modified corpora.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the at least one filter comprises a second filter to determine whether information can be extracted accurately from the new corpus based on general quality metrics of the candidate corpus.
  - 3. The method of claim 2, wherein applying the second filter comprises:
    - collecting the general quality metrics from the candidate corpus, wherein the general quality metrics comprise at least one of a number of good quality sentences, a number of acronyms, cluster of data types in an interested space, accuracy of an English Slot Grammar parser, a volume of data, or document structures; and
      
      comparing the general quality metrics to a set of prerequisites for adding the candidate corpus to the existing corpora.
  - 4. The method of claim 1, wherein determining features associated with high-confidence evidence documents within the current corpora comprises labeling desired documents in the current corpora that show up as evidence for answers with high confidence and extracting features from metadata of the labeled desired documents to form the desired features;
    - andwherein determining misleading features associated with misleading documents within the current corpora comprises labeling misleading documents in the current corpora that show up as evidence for incorrect answers and extracting features from metadata of the labeled misleading documents to form the misleading features.
  - 5. The method of claim 1, wherein applying the first filter further comprises:
    - collecting statistics about scope or type of incorrectly answered questions using the current corpora;
      
      generating the set of questions based on the statistics;
      
      determining a fraction of documents in the candidate corpus that cover the set of questions; and
      
      comparing the fraction of documents in the candidate corpus that cover the set of questions to the set of prerequisites for adding the candidate corpus to the existing corpora.
  - 6. The method of claim 1, wherein applying the first filter further comprises determining a fraction of documents in the candidate corpus that match the desired features and cover the set of questions and comparing the fraction of documents in the candidate corpus that match the desired features and cover the set of questions to the set of prerequisites for adding the candidate corpus to the existing corpora.
  - 7. The method of claim 1, wherein the at least one filter comprises a third filter to determine whether documents in the candidate corpus contain NLP features known to be helpful for performing the NLP operation.
  - 8. The method of claim 7, wherein applying the third filter comprises:
    - extracting a set of the most frequent NLP features from a combination of the current corpora and candidate corpus;
      
      examining evidence and candidate answers for questions answered correctly and incorrectly using the combination of the current corpora and candidate corpus;
      
      determining a set of most effective features from the set of the most frequent NLP features using a machine learning model based on the evidence and candidate answers;
      
      determining a number of the set of most effective features that are present in the candidate corpus; and
      
      comparing the number of the set of most effective features that are present in the candidate corpus to the set of prerequisites for adding the candidate corpus to the existing corpora.
  - 9. The method of claim 1, wherein the at least one filter comprises a fourth filter to determine whether information can be extracted accurately from the new corpus based on general quality metrics of the candidate corpus, a fifth filter to determine whether the candidate corpus contains documents having attributes that match a set of evidence documents that are known to provide high-confidence evidence and contains documents that cover a set of questions not sufficiently covered by the current corpora, and a sixth filter to determine whether documents in the candidate corpus contain NLP features known to be helpful for performing the NLP operation.
  - 10. The method of claim 9, wherein applying the at least one filter comprises:
    - applying the fourth filter to the candidate corpus; and
      
      responsive to the candidate corpus not passing the fourth filter, determining not to add the candidate corpus to the existing corpora.
  - 11. The method of claim 10, wherein applying the at least one filter further comprises:
    - responsive to the candidate corpus passing the fourth filter, applying the fifth filter to the candidate corpus; and
      
      responsive to the candidate corpus not passing the fifth filter, determining not to add the candidate corpus to the existing corpora.
  - 12. The method of claim 11, wherein applying the at least one filter further comprises:
    - responsive to the candidate corpus passing the fifth filter, applying the sixth filter to the candidate corpus; and
      
      responsive to the candidate corpus not passing the sixth filter, determining not to add the candidate corpus to the existing corpora.

13. A computer program product comprising a computer readable storage medium having a computer readable program stored therein, wherein the computer readable program, when executed on a question answering system, causes the question answering system to:
- apply at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplements existing corpora for performing a natural language processing (NLP) operation, wherein the at least one filter comprises a first filter to determine whether the candidate corpus contains documents having attributes that match a set of evidence documents that are known to provide high-confidence evidence and contains documents that cover a set of questions not sufficiently covered by the current corpora, wherein applying the first filter comprises;
  
  determining desired features associated with high-confidence evidence documents within the current corpora;
  
  determining misleading features associated with misleading documents within the current corpora, wherein the misleading documents provide evidence for low-confidence or incorrect answers;
  
  determining a fraction of documents in the candidate corpus that match the desired features and a fraction of documents in the candidate corpus that match the misleading features; and
  
  comparing the fraction of documents in the candidate corpus that match the desired features and the fraction of documents in the candidate corpus that match the misleading features to a set of prerequisites for adding the candidate corpus to the existing corpora;
  
  responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, add the candidate corpus to the existing corpora to form modified corpora; and
  
  perform the NLP operation using the modified corpora.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The computer program product of claim 13, wherein the at least one filter comprises a second filter to determine whether information can be extracted accurately from the new corpus based on general quality metrics of the candidate corpus, and wherein applying the second filter comprises:
    - collecting the general quality metrics from the candidate corpus, wherein the general quality metrics comprise at least one of a number of good quality sentences, a number of acronyms, cluster of data types in an interested space, accuracy of an English Slot Grammar parser, a volume of data, or document structures; and
      
      comparing the general quality metrics to a set of prerequisites for adding the candidate corpus to the existing corpora.
  - 15. The computer program product of claim 13, wherein applying the first filter further comprises:
    - collecting statistics about scope or type of incorrectly answered questions using the current corpora;
      
      generating the set of questions based on the statistics;
      
      determining a fraction of documents in the candidate corpus that cover the questions; and
      
      comparing the fraction of documents in the candidate corpus that cover the set of questions to the set of prerequisites for adding the candidate corpus to the existing corpora.
  - 16. The computer program product of claim 13, wherein the at least one filter comprises a third filter to determine whether documents in the candidate corpus contain NLP features known to be helpful for performing the NLP operation and wherein applying the third filter comprises:
    - extracting a set of the most frequent NLP features from a combination of the current corpora and candidate corpus;
      
      examining evidence and candidate answers for questions answered correctly and incorrectly using the combination of the current corpora and candidate corpus;
      
      determining a set of most effective features from the set of the most frequent NLP features using a machine learning model based on the evidence and candidate answers;
      
      determining a number of the set of most effective features that are present in the candidate corpus; and
      
      comparing the number of the set of most effective features that are present in the candidate corpus to the set of prerequisites for adding the candidate corpus to the existing corpora.
  - 17. The computer program product of claim 13, wherein determining features associated with high-confidence evidence documents within the current corpora comprises labeling desired documents in the current corpora that show up as evidence for answers with high confidence and extracting features from metadata of the labeled desired documents to form the desired features;
    - andwherein determining misleading features associated with misleading documents within the current corpora comprises labeling misleading documents in the current corpora that show up as evidence for incorrect answers and extracting features from metadata of the labeled misleading documents to form the misleading features.
  - 18. The computer program product of claim 13, wherein applying the first filter further comprises determining a fraction of documents in the candidate corpus that match the desired features and cover the set of questions and comparing the fraction of documents in the candidate corpus that match the desired features and cover the set of questions to the set of prerequisites for adding the candidate corpus to the existing corpora.

19. An apparatus comprising:
- a processor; and
  
  a memory coupled to the processor, wherein the memory comprises instructions which, when executed by the processor, cause the processor to;
  
  apply at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplements existing corpora for performing a natural language processing (NLP) operation, wherein the at least one filter comprises a second filter to determine whether the candidate corpus contains documents having attributes that match a set of evidence documents that are known to provide high-confidence evidence and contains documents that cover a set of questions not sufficiently covered by the current corpora and wherein applying the second filter comprises;
  
  determining desired features associated with high-confidence evidence documents within the current corpora;
  
  determining misleading features associated with misleading documents within the current corpora, wherein the misleading documents provide evidence for low-confidence or incorrect answers;
  
  determining a fraction of documents in the candidate corpus that match the desired features and a fraction of documents in the candidates corpus that match the misleading features; and
  
  comparing the fraction of documents in the candidate corpus that match the desired features and the fraction of documents in the candidate corpus that match the misleading features to a set of prerequisites for adding the candidate corpus to the existing corpora;
  
  responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, add the candidate corpus to the existing corpora to form modified corpora; and
  
  perform the NLP operation using the modified corpora.
- View Dependent Claims (20)
- - 20. The apparatus of claim 19, wherein applying the first filter further comprises:
    - collecting statistics about scope or type of incorrectly answered questions using the current corpora;
      
      generating the set of questions based on the statistics;
      
      determining a fraction of documents in the candidate corpus that cover the set of questions; and
      
      comparing the fraction of documents in the candidate corpus that cover the set of questions to the set of prerequisites for adding the candidate corpus to the existing corpora.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Allen, Corville O., Freed, Andrew R., Salmon, Richard A., Strack, Beata J.
Primary Examiner(s)
BURKE, JEFF A

Application Number

US14/444,690
Publication Number

US 20160026634A1
Time in Patent Office

1,135 Days
Field of Search
US Class Current
CPC Class Codes

G06F 40/20   Natural language analysis s...

G06N 20/00   Machine learning

G06N 5/02   Knowledge representation; S...

Corpus quality analysis

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

46 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Corpus quality analysis

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

46 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links