×

Systems and methods for intelligently curating machine learning training data and improving machine learning model performance

  • US 10,679,100 B2
  • Filed: 04/10/2019
  • Issued: 06/09/2020
  • Est. Priority Date: 03/26/2018
  • Status: Active Grant
First Claim
Patent Images

1. A system for intelligently identifying machine learning training data for implementing a machine learning-based dialogue service, the system comprising:

  • one or more sources of machine learning training data;

    one or more hardware computing servers implementing a machine learning-based dialogue service that;

    constructs a corpora of machine learning test corpus that comprise a plurality of historical queries and/or historical commands test-sampled from one or more production logs of a deployed dialogue system, the corpora of machine learning test corpus relates to a baseline set of user queries and/or user commands used in calculating efficacy metrics of raw machine learning data;

    configures one or more training data sourcing parameters to source a corpora of raw machine learning training data from the one or more sources of machine learning training data;

    obtains, from the one or more sources of machine learning training data, the corpora of raw machine learning training data based on the one or more training data sourcing parameters;

    calculates, using the one or more hardware computing servers, efficacy metrics of the corpora of raw machine learning training data, wherein calculating the efficacy metrics includes;

    using the corpora of machine learning test corpus to calculate a coverage metric value that indicates a degree to which the corpora of raw machine learning training data represents possibilities of expressing a target classification intent of the corpora of machine learning test corpus;

    calculating the coverage metric value for each of a plurality of distinct corpus of machine learning training data within the corpora of raw machine learning training data, wherein calculating the coverage metric value for each of the plurality of distinct corpus of machine learning training data includes;

    [i] selecting a subject test corpus datum from within a subject distinct machine learning test corpus of the corpora of machine learning test corpus;

    [ii] constructing a plurality of diversity pairwise comprising the subject test corpus datum and each training data within a subject distinct corpus of machine learning training data of the corpora of raw machine learning training data;

    [iii] calculating a semantic similarities value of each of the plurality of diversity pairwise involving the subject test corpus training datum;

    [iv] identifying a minimum diversity metric value for the subject test corpus datum based on the semantic similarities value of each of the plurality of diversity pairwise involving the subject test corpus training datum;

    [v] calculating a minimum diversity metric value for each remaining test corpus datum within the subject distinct machine learning test corpus; and

    [vi] calculating the coverage metric value for the subject distinct corpus of machine learning training data based on the minimum diversity metric value for the subject test corpus datum and for each of the remaining test corpus datum of the subject distinct machine learning test corpus;

    calculating the coverage metric value for the corpora of raw machine learning training data based on the coverage metric value for each of the plurality of distinct corpus of machine learning training data within the corpora;

    identifies whether to train at least one machine learning classifier of the machine learning-based dialogue system using the corpora of raw machine learning training data based on whether the calculated coverage metric value satisfies a predetermined coverage metric threshold.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×