×

Consistent randomized record-level splitting of machine learning data

  • US 10,366,053 B1
  • Filed: 11/24/2015
  • Issued: 07/30/2019
  • Est. Priority Date: 11/24/2015
  • Status: Active Grant
First Claim
Patent Images

1. A system, comprising:

  • one or more computing devices of a machine learning service of a provider network, wherein the one or more computing devices are configured to;

    receive a request via a programmatic interface to generate, corresponding to a data set comprising a plurality of files collectively containing a plurality of observation records, one or more split subsets using a record-level splitting strategy;

    assign a respective ordinal number to individual ones of the plurality of files;

    generate, corresponding to a particular observation record of the plurality of observation records, wherein the particular observation record is stored in a particular file of the plurality of files, a pseudo-random value based at least in part on (a) the ordinal number assigned to the particular file (b) an offset of the particular observation record within the file, and (c) a seed associated with the data set;

    map the pseudo-random value to a numeric value within a target range of numeric values associated with the request;

    assign, based at least in part on the numeric value, the particular observation record to a first split subset of the one or more split subsets; and

    transmit, to a destination associated with the first split subset, an indication of the assignment of the particular observation record to the first split subset, wherein the indication of assignment is used by the machine learning service to access the first split subset.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×