Consistent randomized record-level splitting of machine learning data
First Claim
Patent Images
1. A system, comprising:
- one or more computing devices of a machine learning service of a provider network, wherein the one or more computing devices are configured to;
receive a request via a programmatic interface to generate, corresponding to a data set comprising a plurality of files collectively containing a plurality of observation records, one or more split subsets using a record-level splitting strategy;
assign a respective ordinal number to individual ones of the plurality of files;
generate, corresponding to a particular observation record of the plurality of observation records, wherein the particular observation record is stored in a particular file of the plurality of files, a pseudo-random value based at least in part on (a) the ordinal number assigned to the particular file (b) an offset of the particular observation record within the file, and (c) a seed associated with the data set;
map the pseudo-random value to a numeric value within a target range of numeric values associated with the request;
assign, based at least in part on the numeric value, the particular observation record to a first split subset of the one or more split subsets; and
transmit, to a destination associated with the first split subset, an indication of the assignment of the particular observation record to the first split subset, wherein the indication of assignment is used by the machine learning service to access the first split subset.
1 Assignment
0 Petitions
Accused Products
Abstract
A request to split a data set comprising observation records located in a group of storage objects is received. With respect to a particular observation record, a token is generated based on an identifier of the record'"'"'s storage object and a key value of the record. A numeric value is calculated using the token, and the observation record is assigned to a split subset using the numeric value. An indication of the assignment is provided to a destination associated with the split subset.
48 Citations
21 Claims
-
1. A system, comprising:
one or more computing devices of a machine learning service of a provider network, wherein the one or more computing devices are configured to; receive a request via a programmatic interface to generate, corresponding to a data set comprising a plurality of files collectively containing a plurality of observation records, one or more split subsets using a record-level splitting strategy; assign a respective ordinal number to individual ones of the plurality of files; generate, corresponding to a particular observation record of the plurality of observation records, wherein the particular observation record is stored in a particular file of the plurality of files, a pseudo-random value based at least in part on (a) the ordinal number assigned to the particular file (b) an offset of the particular observation record within the file, and (c) a seed associated with the data set; map the pseudo-random value to a numeric value within a target range of numeric values associated with the request; assign, based at least in part on the numeric value, the particular observation record to a first split subset of the one or more split subsets; and transmit, to a destination associated with the first split subset, an indication of the assignment of the particular observation record to the first split subset, wherein the indication of assignment is used by the machine learning service to access the first split subset. - View Dependent Claims (2, 3, 4, 5)
-
6. A method, comprising:
performing, by one or more computing devices of a machine learning service; determining that a data set comprising a plurality of observation records is to be split using a record-level splitting algorithm in accordance with a split specification, wherein the plurality of observation records is collectively stored in one or more storage objects; generating, with respect to a particular observation record of the plurality of observation records, wherein the particular observation record is stored in a particular storage object of the one or more storage objects, a token based at least in part on one or more of;
(a) an identifier of the particular storage object or (b) a key value corresponding to the particular observation record, wherein the key value corresponding to the particular observation record differs from respective key values of one or more other observation records stored in the particular storage object;assigning, based at least in part on a particular numeric value calculated using the token, the particular observation record to a first split subset indicated in the split specification; and providing an indication, to a destination indicated in the split specification, of the assignment of the particular observation record to the first split subset, wherein the indication of assignment is used by the machine learning service to access the first split subset. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
17. A non-transitory computer-accessible storage medium storing program instructions that when executed on one or more processors of a machine learning service:
-
generate, with respect to a particular observation record of a plurality of observation records of a data set, wherein the data set is stored in one or more storage objects, wherein the particular observation record is stored in a particular storage object of the one or more storage objects, a token based at least in part on one or more of;
(a) an identifier of the particular storage object or (b) a key value corresponding to the particular observation record, wherein the key value corresponding to the particular observation record differs from respective key values of one or more other observation records stored in the particular storage object;assign, based at least in part on a particular numeric value calculated using the token, the particular observation record to a first split subset indicated in a split specification associated with the data set; and provide an indication, to a destination corresponding to the first split subset, of the assignment of the particular observation record to the first split subset, wherein the indication of assignment is used by the machine learning service to access the first split subset. - View Dependent Claims (18, 19, 20, 21)
-
Specification