Concurrent binning of machine learning data

US 9,672,474 B2
Filed: 09/17/2014
Issued: 06/06/2017
Est. Priority Date: 06/30/2014
Status: Active Grant

First Claim

Patent Images

1. A system, comprising:

one or more computing devices configured to;

receive, at a machine learning service of a provider network, an indication of a data source comprising observation records to be used to generate a model;

identify one or more variables of the observation records as candidates for quantile binning transformations;

determine a particular concurrent binning plan for at least a particular variable of the one or more variables, wherein, in accordance with the particular concurrent binning plan, a plurality of quantile binning transformations are applied to the particular variable during a training phase of the model, wherein the plurality of quantile binning transformations include a first quantile binning transformation with a first bin count and a second quantile binning transformation with a different bin count;

generate, during the training phase, a parameter vector comprising respective initial weight values corresponding to a plurality of binned features obtained as a result of an implementation of the particular concurrent binning plan, including a first binned feature obtained using the first quantile binning transformation and a second binned feature obtained using the second quantile binning transformation;

reduce, during the training phase, at least one weight value corresponding to a particular binned feature of the plurality of binned features in accordance with a selected optimization strategy; and

obtain, during a post-training-phase prediction run of the model, a particular prediction using at least one of;

the first binned feature or the second binned feature.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Variables of observation records to be used to generate a machine learning model are identified as candidates for quantile binning transformations. In accordance with a particular concurrent binning plan generated for a particular variable, a plurality of quantile binning transformations are applied to the particular variable, including a first transformation with a first bin count and a second transformation with a different bin count. The first and second transformations result in the inclusion of respective parameters or weights for binned features in a parameter vector of the model. In a post-training phase run of the model, at least one parameter corresponding to a binned feature is used to generate a prediction.

Citations

20 Claims

1. A system, comprising:
- one or more computing devices configured to;
  
  receive, at a machine learning service of a provider network, an indication of a data source comprising observation records to be used to generate a model;
  
  identify one or more variables of the observation records as candidates for quantile binning transformations;
  
  determine a particular concurrent binning plan for at least a particular variable of the one or more variables, wherein, in accordance with the particular concurrent binning plan, a plurality of quantile binning transformations are applied to the particular variable during a training phase of the model, wherein the plurality of quantile binning transformations include a first quantile binning transformation with a first bin count and a second quantile binning transformation with a different bin count;
  
  generate, during the training phase, a parameter vector comprising respective initial weight values corresponding to a plurality of binned features obtained as a result of an implementation of the particular concurrent binning plan, including a first binned feature obtained using the first quantile binning transformation and a second binned feature obtained using the second quantile binning transformation;
  
  reduce, during the training phase, at least one weight value corresponding to a particular binned feature of the plurality of binned features in accordance with a selected optimization strategy; and
  
  obtain, during a post-training-phase prediction run of the model, a particular prediction using at least one of;
  
  the first binned feature or the second binned feature.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system as recited in claim 1, wherein the one or more variables identified as candidates comprise a plurality of variables, wherein the one or more computing devices are further configured to:
    - in accordance with a second concurrent binning plan for a group of variables of the plurality of variables, wherein the group includes a first variable and a second variable,apply a first multi-variable quantile binning transformation to at least the first variable and the second variable, wherein in accordance with the first multi-variable quantile binning transformation, a particular observation record is placed in a first bin based at least in part on a first combination of bin counts selected for the first and second variables; and
      
      apply a second multi-variable quantile binning transformation to at least the first variable and the second variable, wherein in accordance with the second multi-variable quantile binning transformation, the particular observation record is placed in a second bin based at least in part on a different combination of bin counts selected for the first and second variables.
  - 3. The system as recited in claim 1, wherein the selected optimization strategy comprises regularization.
  - 4. The system as recited in claim 1, wherein the one or more computing devices are further configured to:
    - select a particular binned feature for removal from the parameter vector based at least in part on an estimate of a quantile boundary for weights assigned to a plurality of features of the model, wherein the estimate is obtained without sorting the weights.
  - 5. The system as recited in claim 1, wherein the one or more computing devices are further configured to:
    - store, in an artifact repository of the machine learning service, a particular recipe formatted in accordance with a recipe language for feature transformations implemented at the machine learning service, wherein the particular recipe comprises an indication of the first quantile binning transformation and an indication of the second quantile binning transformation.

6. A method, comprising:
- performing, by one or more computing devices;
  
  implementing a respective concurrent binning plan for one or more variables of observation records to be used to generate a machine learning model, wherein, in accordance with a particular concurrent binning plan, a plurality of quantile binning transformations are applied to at least a particular variable of the one or more variables, wherein the plurality of quantile binning transformations include a first quantile binning transformation with a first bin count and a second quantile binning transformation with a different bin count;
  
  determining respective parameter values associated with a plurality of binned features, including a first binned feature obtained using the first quantile binning transformation and a second binned feature obtained using the second quantile binning transformation; and
  
  generating, during a post-training-phase prediction run of the machine learning model, a particular prediction using a parameter value corresponding to at least one of;
  
  the first binned feature or the second binned feature.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 7. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - in accordance with a second concurrent binning plan generated for a group of variables of the observation records, wherein the group includes a first variable and a second variable,applying a first multi-variable quantile binning transformation to at least the first variable and the second variable, wherein in accordance with the first multi-variable quantile binning transformation, a particular observation record is placed in a first bin based at least in part on a first combination of bin counts selected for the first and second variables; and
      
      applying a second multi-variable quantile binning transformation to at least the first variable and the second variable, wherein in accordance with the second multi-variable quantile binning transformation, the particular observation record is placed in a second bin based at least in part on a different combination of bin counts selected for the first and second variables.
  - 8. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - generating a k-dimensional tree (k-d tree) representation of at least a subset of the observation records, based at least in part on respective values of a selected group of variables of the observation records; and
      
      determining one or more attributes of a concurrent quantile binning transformation to be applied to at least one variable of the one or more variables, based at least in part on an analysis of the k-dimensional tree.
  - 9. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - removing, subsequent to said determining the respective parameter values and prior to said post-training-phase prediction run, a parameter corresponding to at least one binned feature from a parameter vector generated for the machine learning model.
  - 10. The method as recited in claim 9, wherein the parameter vector comprises a respective weight corresponding to one or more individual features of a plurality of features identified for the machine learning model, further comprising performing, by the one or more computing devices:
    - utilizing regularization to adjust a value of a particular weight assigned to a particular binned feature; and
      
      selecting the particular binned feature as a pruning target whose weight is to be removed from the parameter vector based at least in part on a determination that an adjusted value of the particular weight is below a threshold.
  - 11. The method as recited in claim 9, further comprising performing, by the one or more computing devices:
    - selecting a particular binned feature as a pruning target whose weight is to be removed from the parameter vector based at least in part on determining an estimate of a quantile boundary for weights included in the parameter vector, wherein said determining the estimate is performed without sorting the weights.
  - 12. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - determining at least one of;
      
      (a) the first bin count or (b) the different bin count based at least in part on a problem domain of the machine learning model.
  - 13. The method as recited in claim 6, wherein said implementing the respective concurrent binning plan is performed in response to receiving a model generation request via a programmatic interface of a machine learning service implemented at a provider network.
  - 14. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - storing, in an artifact repository of a machine learning service implemented at a provider network, a particular recipe formatted in accordance with a recipe language implemented at the machine learning service, wherein the particular recipe comprises an indication of the first quantile binning transformation and an indication of the second quantile binning transformation.
  - 15. The method as recited in claim 6, wherein the machine learning model comprises one or more of:
    - a supervised learning model, or an unsupervised learning model.

16. A non-transitory computer-accessible storage medium storing program instructions that when executed on one or more processors implements a model generator of a machine learning service, wherein the model generator is configured to:
- identify one or more variables of observation records to be used to generate a machine learning model as candidates for quantile binning transformations;
  
  determine a respective concurrent binning plan for the one or more variables, wherein, in accordance with a particular concurrent binning plan for at least a particular variable, a plurality of quantile binning transformations are applied to the particular variable, wherein the plurality of quantile binning transformations include a first quantile binning transformation with a first bin count and a second quantile binning transformation with a different bin count; and
  
  include, within a parameter vector of the machine learning model, respective parameters for a plurality of binned features, including a first parameter for a first binned feature obtained from the first quantile binning transformation and a second parameter for a second binned feature obtained from the first quantile binning feature, wherein at least one binned feature of the first and second binned features is used to generate a prediction in a post-training-phase execution of the machine learning model.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The non-transitory computer-accessible storage medium as recited in claim 16, wherein the model generator is further configured to:
    - in accordance with a second concurrent binning plan for a group of variables of the observation records, wherein the group includes a first variable and a second variable,apply a first multi-variable quantile binning transformation to at least the first variable and the second variable, wherein in accordance with the first multi-variable quantile binning transformation, a particular observation record is placed in a first bin based at least in part on a first combination of bin counts selected for the first and second variables; and
      
      apply a second multi-variable quantile binning transformation to at least the first variable and the second variable, wherein in accordance with the second multi-variable quantile binning transformation, the particular observation record is placed in a second bin based at least in part on a different combination of bin counts selected for the first and second variables.
  - 18. The non-transitory computer-accessible storage medium as recited in claim 16, wherein the model generator is further configured to:
    - adjust a value of a particular weight assigned to the first binned feature; and
      
      select the first binned feature for removal from the parameter vector based at least in part on a determination that an adjusted value of the particular weight is below a threshold.
  - 19. The non-transitory computer-accessible storage medium as recited in claim 16, wherein the model generator is further configured to:
    - select the first binned feature for removal from the parameter vector based at least in part on an estimate of a quantile boundary for weights assigned to a plurality of features identified for the machine learning model, wherein the estimate is obtained without sorting the weights.
  - 20. The non-transitory computer-accessible storage medium as recited in claim 16, wherein the machine learning model comprises a generalized linear model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Dirac, Leo Parker, Brueckner, Michael, Herbrich, Ralf
Primary Examiner(s)
Smith, Paulinho E

Application Number

US14/489,449
Publication Number

US 20150379428A1
Time in Patent Office

993 Days
Field of Search

None
US Class Current
CPC Class Codes

G06N 20/00 Machine learning

Concurrent binning of machine learning data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Concurrent binning of machine learning data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links