×

Corpus quality analysis

  • US 10,169,706 B2
  • Filed: 08/31/2017
  • Issued: 01/01/2019
  • Est. Priority Date: 07/28/2014
  • Status: Active Grant
First Claim
Patent Images

1. A method, in a data processing system comprising a processor and a memory, the memory comprising instructions executed by the processor to specifically configure the processor to implement a corpus quality analysis system for corpus quality analysis, the method comprising:

  • applying, by the corpus quality analysis system, at least one filter to a candidate corpus to determine a degree to which the candidate corpus supplement existing corpora for performing a natural language processing (NLP) operation, wherein the at least one filter comprises a first filter to determine whether documents in the candidate corpus contain NLP features known to be helpful for performing the NLP operation, wherein applying the first filter comprises;

    extracting a set of the most frequent NLP features from a combination of the current corpora and candidate corpus;

    examining evidence and candidate answers for questions answered correctly and incorrectly using the combination of the current corpora and candidate corpus;

    determining a set most effective features from the set of the most frequent NLP feature using a machine learning model based on the evidence and candidate answers;

    determining a number of the set of most effective features that are present in the candidate corpus; and

    comparing the number of the set of most effective features that are present in the candidate corpus to the set of prerequisites for adding the candidate corpus to the existing corpora;

    responsive to a determination to add the candidate corpus to the existing corpora based on a result of applying the at least one filter, adding, by the corpus quality analysis system, the candidate corpus to the existing corpora to form modified corpora; and

    performing, by a question answering system executing in the data processing system, the NLP operation using the modified corpora.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×