Systems and methods for intelligently curating machine learning training data and improving machine learning model performance

US 10,679,100 B2
Filed: 04/10/2019
Issued: 06/09/2020
Est. Priority Date: 03/26/2018
Status: Active Grant

First Claim

Patent Images

1. A system for intelligently identifying machine learning training data for implementing a machine learning-based dialogue service, the system comprising:

one or more sources of machine learning training data;

one or more hardware computing servers implementing a machine learning-based dialogue service that;

constructs a corpora of machine learning test corpus that comprise a plurality of historical queries and/or historical commands test-sampled from one or more production logs of a deployed dialogue system, the corpora of machine learning test corpus relates to a baseline set of user queries and/or user commands used in calculating efficacy metrics of raw machine learning data;

configures one or more training data sourcing parameters to source a corpora of raw machine learning training data from the one or more sources of machine learning training data;

obtains, from the one or more sources of machine learning training data, the corpora of raw machine learning training data based on the one or more training data sourcing parameters;

calculates, using the one or more hardware computing servers, efficacy metrics of the corpora of raw machine learning training data, wherein calculating the efficacy metrics includes;

using the corpora of machine learning test corpus to calculate a coverage metric value that indicates a degree to which the corpora of raw machine learning training data represents possibilities of expressing a target classification intent of the corpora of machine learning test corpus;

calculating the coverage metric value for each of a plurality of distinct corpus of machine learning training data within the corpora of raw machine learning training data, wherein calculating the coverage metric value for each of the plurality of distinct corpus of machine learning training data includes;

[i] selecting a subject test corpus datum from within a subject distinct machine learning test corpus of the corpora of machine learning test corpus;

[ii] constructing a plurality of diversity pairwise comprising the subject test corpus datum and each training data within a subject distinct corpus of machine learning training data of the corpora of raw machine learning training data;

[iii] calculating a semantic similarities value of each of the plurality of diversity pairwise involving the subject test corpus training datum;

[iv] identifying a minimum diversity metric value for the subject test corpus datum based on the semantic similarities value of each of the plurality of diversity pairwise involving the subject test corpus training datum;

[v] calculating a minimum diversity metric value for each remaining test corpus datum within the subject distinct machine learning test corpus; and

[vi] calculating the coverage metric value for the subject distinct corpus of machine learning training data based on the minimum diversity metric value for the subject test corpus datum and for each of the remaining test corpus datum of the subject distinct machine learning test corpus;

calculating the coverage metric value for the corpora of raw machine learning training data based on the coverage metric value for each of the plurality of distinct corpus of machine learning training data within the corpora;

identifies whether to train at least one machine learning classifier of the machine learning-based dialogue system using the corpora of raw machine learning training data based on whether the calculated coverage metric value satisfies a predetermined coverage metric threshold.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods of intelligent formation and acquisition of machine learning training data for implementing an artificially intelligent dialogue system includes constructing a corpora of machine learning test corpus that comprise a plurality of historical queries and commands sampled from production logs of a deployed dialogue system; configuring training data sourcing parameters to source a corpora of raw machine learning training data from remote sources of machine learning training data; calculating efficacy metrics of the corpora of raw machine learning training data, wherein calculating the efficacy metrics includes calculating one or more of a coverage metric value and a diversity metric value of the corpora of raw machine learning training data; using the corpora of raw machine learning training data to train the at least one machine learning classifier if the calculated coverage metric value of the corpora of machine learning training data satisfies a minimum coverage metric threshold.

57 Citations

View as Search Results

17 Claims

1. A system for intelligently identifying machine learning training data for implementing a machine learning-based dialogue service, the system comprising:
- one or more sources of machine learning training data;
  
  one or more hardware computing servers implementing a machine learning-based dialogue service that;
  
  constructs a corpora of machine learning test corpus that comprise a plurality of historical queries and/or historical commands test-sampled from one or more production logs of a deployed dialogue system, the corpora of machine learning test corpus relates to a baseline set of user queries and/or user commands used in calculating efficacy metrics of raw machine learning data;
  
  configures one or more training data sourcing parameters to source a corpora of raw machine learning training data from the one or more sources of machine learning training data;
  
  obtains, from the one or more sources of machine learning training data, the corpora of raw machine learning training data based on the one or more training data sourcing parameters;
  
  calculates, using the one or more hardware computing servers, efficacy metrics of the corpora of raw machine learning training data, wherein calculating the efficacy metrics includes;
  
  using the corpora of machine learning test corpus to calculate a coverage metric value that indicates a degree to which the corpora of raw machine learning training data represents possibilities of expressing a target classification intent of the corpora of machine learning test corpus;
  
  calculating the coverage metric value for each of a plurality of distinct corpus of machine learning training data within the corpora of raw machine learning training data, wherein calculating the coverage metric value for each of the plurality of distinct corpus of machine learning training data includes;
  
  [i] selecting a subject test corpus datum from within a subject distinct machine learning test corpus of the corpora of machine learning test corpus;
  
  [ii] constructing a plurality of diversity pairwise comprising the subject test corpus datum and each training data within a subject distinct corpus of machine learning training data of the corpora of raw machine learning training data;
  
  [iii] calculating a semantic similarities value of each of the plurality of diversity pairwise involving the subject test corpus training datum;
  
  [iv] identifying a minimum diversity metric value for the subject test corpus datum based on the semantic similarities value of each of the plurality of diversity pairwise involving the subject test corpus training datum;
  
  [v] calculating a minimum diversity metric value for each remaining test corpus datum within the subject distinct machine learning test corpus; and
  
  [vi] calculating the coverage metric value for the subject distinct corpus of machine learning training data based on the minimum diversity metric value for the subject test corpus datum and for each of the remaining test corpus datum of the subject distinct machine learning test corpus;
  
  calculating the coverage metric value for the corpora of raw machine learning training data based on the coverage metric value for each of the plurality of distinct corpus of machine learning training data within the corpora;
  
  identifies whether to train at least one machine learning classifier of the machine learning-based dialogue system using the corpora of raw machine learning training data based on whether the calculated coverage metric value satisfies a predetermined coverage metric threshold.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The system according to claim 1, whereincalculating the efficacy metrics includes calculating a diversity metric value of the corpora of raw machine learning training data.
  - 3. The system according to claim 1, wherein the machine learning-based dialogue service further:
    - uses the corpora of raw machine learning training data, as machine learning training input, to train the at least one machine learning classifier if a calculated coverage metric value of the corpora of machine learning training data satisfies a minimum coverage metric threshold.
  - 4. The system according to claim 3, wherein the machine learning-based dialogue service further:
    - responsive to training the at least one machine learning classifier using the corpora of raw machine learning training data, deploys the at least one machine learning classifier into a live implementation of the artificially intelligent dialogue system.
  - 5. The system according to claim 1, whereinthe machine learning-based dialogue service calculates the coverage metric value of the corpora of raw machine learning training data according to the following equations:
  - 6. The system according to claim 1, whereinanalyzing the efficacy metrics of the corpora of raw machine learning training data includes:
    - calculating a diversity metric value for each of a plurality of distinct corpus of machine learning training data within the corpora of raw machine learning training data, wherein the diversity metric value relates to a measure indicating a level of heterogeneity among machine learning data within a distinct corpus of machine learning training data; and
      
      calculating an aggregated diversity metric value for the corpora of raw machine learning training data based on the diversity metric value for each of the plurality of distinct corpus of machine learning training data within the corpora.
  - 7. The system according to claim 6, whereinthe machine learning-based dialogue service calculates the diversity metric value of the corpora of raw machine learning training data according to the following equations:
  - 8. The system according to claim 1, wherein:
    - the corpora of machine learning test corpus is defined by a plurality of distinct machine learning test corpus,each of the plurality of distinct machine learning test corpus is associated with a distinct intent classification task of the machine learning-based dialogue service, andeach of the plurality of distinct machine learning test corpus includes at least one subset of the plurality of historical queries and/or historical commands obtained from a deployed dialogue system.
  - 9. The system according to claim 1, whereinthe machine learning-based dialogue service further constructs the corpora of machine learning test corpus using a plurality of engineered queries and/or engineered commands, andeach of the plurality of engineered queries and/or engineered commands is artificially generated for one or more identified intent classification tasks.
  - 10. The system according to claim 1, whereinconfiguring the one or more training data sourcing parameters includes:
    - generating a plurality of distinct sets of prompts for sourcing raw machine learning training data for each of a plurality of intent classification tasks of the machine learning-based dialogue service.
  - 11. The system according to claim 10, wherein:
    - generating the plurality of distinct sets of prompts is based on a plurality of historical user queries and/or a plurality of historical user commands,generating the plurality of distinct sets of prompts includes;
      
      test sampling by the machine learning-based dialogue service the plurality of historical user queries and/or the plurality of historical user commands from one or more production logs of a deployed dialogue system, andconverting the plurality of historical user queries and/or the plurality of historical user commands into the set of prompts for sourcing raw machine learning training data.
  - 12. The system according to claim 10, whereinthe plurality of distinct sets of prompts comprises a combination of:
    - a plurality of scenario-driven prompts, wherein each of the plurality of scenario-driven prompts describes a real-world circumstance for which a response is required; and
      
      a plurality of paraphrasing requests, wherein each of the plurality of paraphrasing requests includes an instruction to rephrase and/or paraphrase a given prompt or a given statement.
  - 13. The system according to claim 12, wherein:
    - the machine learning-based dialogue service constructs a composition of the plurality of distinct sets of prompts to include a predetermined ratio of scenario-driven prompts to paraphrasing requests, anda number of the plurality of scenario-driven prompts in the predetermined ratio of scenario-driven prompts is greater than a number of the plurality of the paraphrasing requests.
  - 14. The system according to claim 1, further comprising:
    - calculating by the machine learning-based dialogue service;
      
      an aggregated coverage metric value for the corpora of raw machine learning training data, wherein calculating the aggregated diversity metric includes;
      
      calculating an average coverage metric value by calculating a sum of the coverage metric value for each of the plurality of distinct corpus of machine learning training data that defines the corpora and dividing the sum by a number of the distinct corpus of machine learning training data within the corpora.
  - 15. The system according to claim 7, whereincalculating the diversity metric value for each of the plurality of distinct corpus of machine learning training data includes:
    - [i] selecting a subject training datum from training data within a subject distinct corpus of machine learning training data of the plurality of distinct corpus of machine learning training data;
      
      [ii] constructing a plurality of diversity pairwise comprising the subject training datum and each of a remaining training data within the subject distinct corpus of machine learning training data;
      
      [iii] calculating a semantic difference value of each of the plurality of diversity pairwise involving the subject training datum;
      
      [iv] calculating a specific diversity metric value for the subject training datum based on an average of the semantic difference value of each of the plurality of diversity pairwise involving the subject training datum;
      
      [v] calculating a specific diversity metric value for each of the remaining training data within the subject distinct corpus of machine learning training data; and
      
      [vi] calculating the diversity metric value for the subject distinct corpus of machine learning training data based on the specific diversity metric value for the subject training datum and for each of the remaining training data of the subject distinct corpus of the machine learning training data.

16. A method for intelligently curating machine learning training data for implementing a machine learning-based dialogue service, the method comprising:
- a machine learning-based dialogue service implemented by one or more hardware computing servers;
  
  constructing a corpora of machine learning test corpus that comprise a plurality of historical queries and/or historical commands test-sampled from one or more production logs of a deployed dialogue system, the corpora of machine learning test corpus relates to a baseline set of user queries and/or user commands used in calculating efficacy metrics of raw machine learning data;
  
  configuring one or more training data sourcing parameters to source a corpora of raw machine learning training data from the one or more sources of machine learning training data;
  
  obtaining, from the one or more sources of machine learning training data, the corpora of raw machine learning training data based on the one or more training data sourcing parameters;
  
  calculating, using the one or more hardware computing servers, efficacy metrics of the corpora of raw machine learning training data, wherein calculating the efficacy metrics includes;
  
  using the corpora of machine learning test corpus to calculate a coverage metric value that indicates a degree to which the corpora of raw machine learning training data represents possibilities of expressing a target classification intent of the corpora of machine learning test corpus;
  
  calculating the coverage metric value for each of a plurality of distinct corpus of machine learning training data within the corpora of raw machine learning training data;
  
  calculating the coverage metric value for the corpora of raw machine learning training data based on the coverage metric value for each of the plurality of distinct corpus of machine learning training data within the corpora; and
  
  wherein the machine learning-based dialogue service calculates the coverage metric value of the corpora of raw machine learning training data according to;

17. A method for intelligently curating machine learning training data for implementing a machine learning-based dialogue service, the method comprising:
- a machine learning-based dialogue service implemented by distributed network of computers;
  
  configuring one or more training data sourcing parameters to source a corpora of raw machine learning training data from the one or more sources of machine learning training data;
  
  obtaining, from the one or more sources of machine learning training data, the corpora of raw machine learning training data based on the one or more training data sourcing parameters;
  
  calculating, using the one or more hardware computing servers, efficacy metrics of the corpora of raw machine learning training data, wherein calculating the efficacy metrics includes using a corpora of machine learning test corpus to calculate a coverage metric value that indicates a degree to which the corpora of raw machine learning training data represents possibilities of expressing a target classification intent of the corpora of machine learning test corpus;
  
  calculating the coverage metric value for each of a plurality of distinct corpus of machine learning training data within the corpora of raw machine learning training data;
  
  calculating the coverage metric value for the corpora of raw machine learning training data based on the coverage metric value for each of the plurality of distinct corpus of machine learning training data within the corpora; and
  
  wherein the machine learning-based dialogue service calculates the coverage metric value of the corpora of raw machine learning training data according to;

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Clinc, Inc.
Original Assignee
Clinc, Inc.
Inventors
Kang, Yiping, Zhang, Yunqi, Kummerfeld, Jonathan K., Hill, Parker, Hauswald, Johann, Laurenzano, Michael A., Tang, Lingjia, Mars, Jason
Primary Examiner(s)
Afshar, Kamran
Assistant Examiner(s)
Chen, Ying Yu

Application Number

US16/379,978
Publication Number

US 20190294925A1
Time in Patent Office

426 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/3329   Natural language query form...

G06F 18/214   Generating training pattern...

G06N 20/00   Machine learning

G06N 20/10   using kernel methods, e.g. ...

G06N 20/20   Ensemble learning

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G06N 3/047   Probabilistic or stochastic...

G06N 3/084   Backpropagation, e.g. using...

G06N 3/088   Non-supervised learning, e....

G06N 5/01   Dynamic search techniques; ...

G06N 5/025   Extracting rules from data

G06N 5/04   Inference or reasoning models

G06N 7/01   Probabilistic graphical mod...

Systems and methods for intelligently curating machine learning training data and improving machine learning model performance

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

57 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for intelligently curating machine learning training data and improving machine learning model performance

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

57 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links