×

Corpus quality analysis

  • US 9,754,207 B2
  • Filed: 07/28/2014
  • Issued: 09/05/2017
  • Est. Priority Date: 07/28/2014
  • Status: Active Grant
First Claim
Patent Images

1. A method, in a data processing system, for corpus quality analysis, the method comprising:

  • applying at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplements existing corpora for performing a natural language processing (NLP) operation, wherein the at least one filter comprises a first filter to determine whether the candidate corpus contains documents having attributes that match a set of evidence documents that are known to provide high-confidence evidence and contains documents that cover a set of questions not sufficiently covered by the current corpora, wherein applying the first filter comprises;

    determining desired features associated with high-confidence evidence documents within the current corpora;

    determining misleading features associated with misleading documents within the current corpora, wherein the misleading documents provide evidence for low-confidence or incorrect answers;

    determining a fraction of documents in the candidate corpus that match the desired features and a fraction of documents in the candidate corpus that match the misleading features; and

    comparing the fraction of documents in the candidate corpus that match the desired features and the fraction of documents in the candidate corpus that match the misleading features to a set of prerequisites for adding the candidate corpus to the existing corpora;

    responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, adding the candidate corpus to the existing corpora to form modified corpora; and

    performing the NLP operation using the modified corpora.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×