Target variable distribution-based acceptance of machine learning test data sets

US 10,726,356 B1
Filed: 08/01/2016
Issued: 07/28/2020
Est. Priority Date: 08/01/2016
Status: Active Grant

First Claim

Patent Images

1. A system, comprising:

one or more computing devices of a machine learning service of a provider network, wherein the one or more computing devices are configured to;

identify, with respect to a particular machine learning model to be trained on behalf of a client to predict values of a target variable, a proposed training data set and a proposed test data set, wherein the target variable is an output variable of the particular machine learning model;

determine that the proposed test data set meets a triggering criterion for execution of a selected target variable distribution comparison algorithm;

obtain, based on an examination of at least a portion of the proposed training data set, a first statistical distribution of the target variable within the proposed training data set in accordance with the selected target variable distribution algorithm;

obtain, based on an examination of at least a portion of the proposed test data set, a second statistical distribution of the target variable within the proposed test data set;

compute a metric indicative of a difference between the first statistical distribution and the second statistical distribution;

determine an acceptance criterion for evaluating the particular machine learning model, wherein said evaluating is to be performed after the particular machine learning model has been trained using the proposed training data set;

determine, based at least in part on the metric, that the proposed test data set meets the acceptance criterion for evaluating the particular machine learning model; and

provide, to the client, an indication of a prediction quality metric of the particular machine learning model, wherein the prediction quality metric is obtained using the proposed test data set.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Respective statistical distributions of a target variable within a proposed training data set and a proposed test data set for a machine learning model are obtained. A metric indicative of the difference between the two statistical distributions is computed. The difference metric is used to determine whether the proposed test data set is acceptable to evaluate the machine learning model.

Citations

21 Claims

1. A system, comprising:
- one or more computing devices of a machine learning service of a provider network, wherein the one or more computing devices are configured to;
  
  identify, with respect to a particular machine learning model to be trained on behalf of a client to predict values of a target variable, a proposed training data set and a proposed test data set, wherein the target variable is an output variable of the particular machine learning model;
  
  determine that the proposed test data set meets a triggering criterion for execution of a selected target variable distribution comparison algorithm;
  
  obtain, based on an examination of at least a portion of the proposed training data set, a first statistical distribution of the target variable within the proposed training data set in accordance with the selected target variable distribution algorithm;
  
  obtain, based on an examination of at least a portion of the proposed test data set, a second statistical distribution of the target variable within the proposed test data set;
  
  compute a metric indicative of a difference between the first statistical distribution and the second statistical distribution;
  
  determine an acceptance criterion for evaluating the particular machine learning model, wherein said evaluating is to be performed after the particular machine learning model has been trained using the proposed training data set;
  
  determine, based at least in part on the metric, that the proposed test data set meets the acceptance criterion for evaluating the particular machine learning model; and
  
  provide, to the client, an indication of a prediction quality metric of the particular machine learning model, wherein the prediction quality metric is obtained using the proposed test data set.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system as recited in claim 1, wherein the target variable comprises a categorical variable, and wherein to obtain the second statistical distribution, the one or more computing devices are configured to:
    - generate a histogram comprising a plurality of buckets, wherein a particular bucket of the plurality of buckets corresponds to a particular value of the categorical variable;
      
      and wherein the metric indicative of the difference between the first statistical distribution and the second statistical distribution comprises a variant of a Kullback-Leibler divergence.
  - 3. The system as recited in claim 1, wherein the machine learning model comprises a linear regression model, and wherein to obtain the second statistical distribution, the one or more computing devices are configured to:
    - generate an approximate quantile summary of the observed values of the target variable in the proposed test data set;
      
      and wherein the metric indicative of the difference between the first statistical distribution and the second statistical distribution comprises a variant of a Kolmogorov-Smirnov statistic.
  - 4. The system as recited in claim 1, wherein the one or more computing devices are configured to:
    - transmit, via a graphical programmatic interface to the client, a representation of the difference between the first statistical distribution and the second statistical distribution.
  - 5. The system as recited in claim 1, wherein the one or more computing devices are configured to:
    - determine, based at least in part on using the selected target variable distribution comparison algorithm, that a second proposed test data set is not acceptable for evaluating a second machine learning model on behalf of the client; and
      
      transmit, via the programmatic interface to the client, a recommendation to utilize a particular split algorithm to generate a different test data set.

6. A method, comprising:
- performing, by one or more computing devices;
  
  obtaining, based on an examination of at least a portion of a proposed training data set for a machine learning model, a first statistical distribution of a target variable within the proposed training data set, wherein the target variable is an output variable of the machine learning model, and wherein values of the target variable are to be predicted by the machine learning model;
  
  obtaining, based on an examination of at least a portion of a proposed test data set for the machine learning model, a second statistical distribution of the target variable within the proposed test data set;
  
  determining an acceptance criterion for evaluating the machine learning model;
  
  determining, based at least in part on a metric indicative of a difference between the first statistical distribution and the second statistical distribution, that the proposed test data set fails to meet the acceptance criterion for the machine learning model; and
  
  providing, via a programmatic interface, an indication that a test data set other than the proposed test data set should be utilized to evaluate the machine learning model.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 7. The method as recited in claim 6, wherein the target variable comprises a categorical variable, wherein said obtaining the second statistical distribution comprises:
    - constructing a histogram comprising a plurality of buckets, wherein a particular bucket of the plurality of buckets corresponds to a particular value of the categorical variable;
      
      and wherein the metric indicative of the difference between the first statistical distribution and the second statistical distribution comprises a variant of a Kullback-Leibler divergence.
  - 8. The method as recited in claim 6, wherein the machine learning model comprises a linear regression model, wherein said obtaining the second statistical distribution comprises:
    - generating an approximate quantile summary of the observed values of the target variable in the proposed test data set;
      
      and wherein the metric indicative of the difference between the first statistical distribution and the second statistical distribution comprises a variant of a Kolmogorov-Smirnov statistic.
  - 9. The method as recited in claim 8, wherein said generating the approximate quantile summary comprises executing a Greenwald-Khanna algorithm.
  - 10. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - computing a first value representing a difference between (a) a cumulative distribution function associated with a first quantile of the target variable distribution in the proposed training data set and (b) a cumulative distribution function associated with the first quantile of the target variable distribution in the proposed test data set;
      
      computing a second value representing a difference between (a) a cumulative distribution function associated with a second quantile of the target variable distribution in the proposed training data set and (b) a cumulative distribution function associated with the second quantile of the target variable distribution in the proposed test data set, wherein the second value is smaller than the first value;
      
      utilizing the second value as the metric of the difference between the first statistical distribution and the second statistical distribution.
  - 11. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - transmitting, via a graphical programmatic interface to a client, a representation of the difference between the first statistical distribution and the second statistical distribution.
  - 12. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - transmitting, via the programmatic interface to a client, a recommendation to utilize a particular split algorithm to generate a different test data set.
  - 13. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - determining that a particular data set comprising a plurality of observation records is to be split into a first training data set and a first test data set on behalf of a client;
      
      splitting the particular data set using a selected split algorithm; and
      
      verifying, prior to providing an indication to the client that the particular data set has been split, that the first test data set meets an acceptance criterion, wherein said verifying comprises using a selected target variable distribution comparison algorithm.
  - 14. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - determining, based at least in part on an execution of a selected target variable distribution comparison algorithm, that a second proposed data set does not meet an acceptance criterion with respect to a second machine learning model;
      
      introducing one or more synthetic observation records into the second proposed test data set;
      
      verifying, using the selected target variable distribution comparison algorithm, that the modified version of the second proposed test data set meets the acceptance criterion with respect to the second machine learning model, wherein the modified version comprises the one or more synthetic observation records; and
      
      utilizing the modified version of the second proposed test data set to evaluate the second machine learning model.
  - 15. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - determining that predictions for target variable values generated by the machine learning model with respect to a particular set of post-evaluation observation records do not meet a quality goal, wherein the post-evaluation observation records are provided as input to the machine learning model after the machine learning model has been trained using a second training data set which differs from the proposed training data set;
      
      initiating an execution of an algorithm to compare distributions of one or more variables of the second training data set and the particular set of post-evaluation observation records; and
      
      based at least in part on a result of the algorithm, initiating a re-training of the machine learning model using a third training data set.

16. A non-transitory computer-accessible storage medium storing program instructions that when executed on one or more processors:
- identify (a) a proposed training data set for a machine learning model and (b) a proposed test data set for the machine learning model wherein values of a target variable are to be predicted by the machine learning model, and wherein the target variable is an output variable of the machine learning model;
  
  determine an acceptance criterion for evaluating the machine learning model;
  
  determine, based at least in part on a metric indicative of a difference between (a) a first statistical distribution of the target variable within the proposed training data set and (b) a second statistical distribution of the target variable within the proposed test data set, that the proposed test data set meets the acceptance criterion for the machine learning model; and
  
  provide, via a programmatic interface, an indication of approval of the proposed test data set for an evaluation of the machine learning model.
- View Dependent Claims (17, 18, 19, 20, 21)
- - 17. The non-transitory computer-accessible storage medium as recited in claim 16, wherein the target variable comprises a categorical variable, wherein the instructions when executed on the one or more processors:
    - construct a histogram comprising a plurality of buckets, wherein a particular bucket of the plurality of buckets corresponds to a particular value of the categorical variable;
      
      and wherein the metric indicative of the difference between the first statistical distribution and the second statistical distribution comprises a variant of a Kullback-Leibler divergence.
  - 18. The non-transitory computer-accessible storage medium as recited in claim 16, wherein the target variable comprises a continuous variable, wherein the instructions when executed on the one or more processors:
    - generate an approximate quantile summary of the observed values of the target variable in the proposed test data set;
      
      and wherein the metric indicative of the difference between the first statistical distribution and the second statistical distribution comprises a variant of a Kolmogorov-Smirnov statistic.
  - 19. The non-transitory computer-accessible storage medium as recited in claim 16, wherein the instructions when executed on the one or more processors:
    - transmit, via a graphical programmatic interface to a client, a representation of the difference between the first statistical distribution and the second statistical distribution.
  - 20. The non-transitory computer-accessible storage medium as recited in claim 16, wherein the instructions when executed on the one or more processors:
    - select, from a plurality of algorithms, based at least in part on a size of a second proposed test data set for a second machine learning model, a particular algorithm to be used to verify acceptability of the second proposed test data set.
  - 21. The non-transitory computer-accessible storage medium as recited in claim 16, wherein the instructions when executed on the one or more processors:
    - obtain (a) a third statistical distribution of a non-target variable within the proposed training data set and (b) a fourth statistical distribution of the non-target variable within the proposed test data set; and
      
      provide, to a client, an indication of one or more of;
      
      the third statistical distribution or the fourth statistical distribution.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Zarandioon, Saman, Steele, Robert Matthias
Primary Examiner(s)
Cassity, Robert A
Assistant Examiner(s)
Metwalli, Nader

Application Number

US15/225,545
Time in Patent Office

1,457 Days
Field of Search
US Class Current
CPC Class Codes

G06N 20/00 Machine learning

G06N 7/01 Probabilistic graphical mod...

Target variable distribution-based acceptance of machine learning test data sets

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Target variable distribution-based acceptance of machine learning test data sets

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links