Consistent randomized record-level splitting of machine learning data

US 10,366,053 B1
Filed: 11/24/2015
Issued: 07/30/2019
Est. Priority Date: 11/24/2015
Status: Active Grant

First Claim

Patent Images

1. A system, comprising:

one or more computing devices of a machine learning service of a provider network, wherein the one or more computing devices are configured to;

receive a request via a programmatic interface to generate, corresponding to a data set comprising a plurality of files collectively containing a plurality of observation records, one or more split subsets using a record-level splitting strategy;

assign a respective ordinal number to individual ones of the plurality of files;

generate, corresponding to a particular observation record of the plurality of observation records, wherein the particular observation record is stored in a particular file of the plurality of files, a pseudo-random value based at least in part on (a) the ordinal number assigned to the particular file (b) an offset of the particular observation record within the file, and (c) a seed associated with the data set;

map the pseudo-random value to a numeric value within a target range of numeric values associated with the request;

assign, based at least in part on the numeric value, the particular observation record to a first split subset of the one or more split subsets; and

transmit, to a destination associated with the first split subset, an indication of the assignment of the particular observation record to the first split subset, wherein the indication of assignment is used by the machine learning service to access the first split subset.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A request to split a data set comprising observation records located in a group of storage objects is received. With respect to a particular observation record, a token is generated based on an identifier of the record'"'"'s storage object and a key value of the record. A numeric value is calculated using the token, and the observation record is assigned to a split subset using the numeric value. An indication of the assignment is provided to a destination associated with the split subset.

48 Citations

View as Search Results

21 Claims

1. A system, comprising:
- one or more computing devices of a machine learning service of a provider network, wherein the one or more computing devices are configured to;
  
  receive a request via a programmatic interface to generate, corresponding to a data set comprising a plurality of files collectively containing a plurality of observation records, one or more split subsets using a record-level splitting strategy;
  
  assign a respective ordinal number to individual ones of the plurality of files;
  
  generate, corresponding to a particular observation record of the plurality of observation records, wherein the particular observation record is stored in a particular file of the plurality of files, a pseudo-random value based at least in part on (a) the ordinal number assigned to the particular file (b) an offset of the particular observation record within the file, and (c) a seed associated with the data set;
  
  map the pseudo-random value to a numeric value within a target range of numeric values associated with the request;
  
  assign, based at least in part on the numeric value, the particular observation record to a first split subset of the one or more split subsets; and
  
  transmit, to a destination associated with the first split subset, an indication of the assignment of the particular observation record to the first split subset, wherein the indication of assignment is used by the machine learning service to access the first split subset.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system as recited in claim 1, wherein to generate the pseudo-random value, the one or more computing devices are configured to:
    - apply one or more hash functions to a string obtained by concatenating the ordinal number, the offset and the seed.
  - 3. The system as recited in claim 1, wherein the particular file includes a set of observation records comprising the particular observation record, wherein the one or more computing devices are configured to:
    - assign, in a single sequential pass of analysis through the particular file, wherein the single sequential pass does not include a random access to an offset within the particular file, individual ones of the set of observation records to the one or more split subsets.
  - 4. The system as recited in claim 1, wherein at least one observation record is added to the data set after the particular observation record has been assigned to the first split subset.
  - 5. The system as recited in claim 1, wherein the one or more computing devices include a first execution platform of the machine learning service, a second execution platform of the machine learning service and a control-plane component of the machine learning service, wherein the control-plane component is configured to:
    - assign (a) a first task to split observation records of at least a portion of the particular file to the first execution platform and (b) a second task to split observation records of at least a portion of a different file of the plurality of files to the second execution platform.

6. A method, comprising:
- performing, by one or more computing devices of a machine learning service;
  
  determining that a data set comprising a plurality of observation records is to be split using a record-level splitting algorithm in accordance with a split specification, wherein the plurality of observation records is collectively stored in one or more storage objects;
  
  generating, with respect to a particular observation record of the plurality of observation records, wherein the particular observation record is stored in a particular storage object of the one or more storage objects, a token based at least in part on one or more of;
  
  (a) an identifier of the particular storage object or (b) a key value corresponding to the particular observation record, wherein the key value corresponding to the particular observation record differs from respective key values of one or more other observation records stored in the particular storage object;
  
  assigning, based at least in part on a particular numeric value calculated using the token, the particular observation record to a first split subset indicated in the split specification; and
  
  providing an indication, to a destination indicated in the split specification, of the assignment of the particular observation record to the first split subset, wherein the indication of assignment is used by the machine learning service to access the first split subset.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 7. The method as recited in claim 6, wherein the one or more storage objects include a plurality of files located within one or more directories, and wherein the particular storage object is a particular file of the plurality of files, further comprising performing, by one or more computing devices:
    - generating, based at least in part on a lexicographic ordering of respective names of the plurality of files, the identifier of the particular file.
  - 8. The method as recited in claim 6, wherein the key value corresponding to the particular observation record is based at least in part on an offset of the particular observation record within the particular storage object.
  - 9. The method as recited in claim 6, wherein the one or more storage objects include a database table.
  - 10. The method as recited in claim 6, wherein a size of the particular observation record differs from a size of another observation record of the plurality of observation records.
  - 11. The method as recited in claim 6, wherein said generating the token comprises:
    - determining an order in which a plurality of token-contributor elements are to be concatenated, wherein the plurality of token-contributor elements include (a) the identifier of the storage object and (b) the key value;
      
      concatenating the plurality of token-contributor elements; and
      
      applying one or more hash functions to a result of said concatenating.
  - 12. The method as recited in claim 11, wherein the one or more hash functions include one or more of:
    - (a) a Murmur hash function (b) a Fowler-Noll-Vo (FNV) hash function (c) a Jenkins hash function, (d) a CityHash function, (e) a function based on a version of the Secure Hash Algorithm (SHA), or (f) an MD5 (Message Digest
      
      5) function.
  - 13. The method as recited in claim 6, further comprising performing, by one or more computing devices:
    - identifying a seed value based at least in part on one or more of (a) contents of the split specification, (b) an identity of a client on whose behalf the data set is being split, or (c) a timestamp;
      
      wherein said generating the token comprises utilizing the seed value.
  - 14. The method as recited in claim 6, wherein the first split subset comprises a test set to be used for a particular cross-validation run of a machine learning model corresponding to the data set, wherein the split specification indicates a starting boundary and an ending boundary for the first split subset, wherein the starting boundary is a first numeric value within a selected range of numeric values, wherein the ending boundary is a second numeric value within the selected range, and wherein assigning, based at least in part on the particular numeric value calculated using the token, the particular observation record to the first split subset comprises:
    - determining that the particular numeric value is (a) greater than or equal to the first numeric value and (b) less than the second numeric value.
  - 15. The method as recited in claim 14, wherein the split specification indicates a first setting for a complement element, further comprising performing, by the one or more computing devices:
    - determining, based at least in part on examining a second split specification, that a second split subset of the data set is to be generated, wherein the second split subset comprises the training set of the particular cross-validation run, and wherein the second split specification indicates (a) a second setting for the complement element, (b) the starting boundary, and (c) the ending boundary; and
      
      assigning a different observation record of the plurality of observation records to the second split subset, based on a determination that a numeric value calculated using a token derived from the different observation record is (a) less than the first numeric value or (b) greater than or equal to the second numeric value.
  - 16. The method as recited in claim 6, wherein the one or more storage objects include a second storage object, wherein the second storage object comprises a particular number of observation records, further comprising performing, by the one or more computing devices:
    - distributing, as part of a single sequential pass of analysis through the second storage object, wherein no observation record of the particular number of observation records is examined more than once in the single sequential pass, individual ones of the particular number of observation records among one or more split subsets indicated in the split specification.

17. A non-transitory computer-accessible storage medium storing program instructions that when executed on one or more processors of a machine learning service:
- generate, with respect to a particular observation record of a plurality of observation records of a data set, wherein the data set is stored in one or more storage objects, wherein the particular observation record is stored in a particular storage object of the one or more storage objects, a token based at least in part on one or more of;
  
  (a) an identifier of the particular storage object or (b) a key value corresponding to the particular observation record, wherein the key value corresponding to the particular observation record differs from respective key values of one or more other observation records stored in the particular storage object;
  
  assign, based at least in part on a particular numeric value calculated using the token, the particular observation record to a first split subset indicated in a split specification associated with the data set; and
  
  provide an indication, to a destination corresponding to the first split subset, of the assignment of the particular observation record to the first split subset, wherein the indication of assignment is used by the machine learning service to access the first split subset.
- View Dependent Claims (18, 19, 20, 21)
- - 18. The non-transitory computer-accessible storage medium as recited in claim 17, wherein to provide the indication of the assignment of the particular observation record to the first split subset, the instructions when executed on the one or more processors:
    - transmit at least a portion of contents of the particular observation record to the destination.
  - 19. The non-transitory computer-accessible storage medium as recited in claim 17, wherein the instructions when executed on the one or more processors:
    - examine a split request corresponding to a second data set, wherein the split request indicates a particular split strategy to be implemented on the second data set on behalf of a particular client of the machine learning service, wherein the particular split strategy is selected from a set comprising (a) a sequential split strategy (b) a chunk-level split strategy, or (c) a record-level strategy;
      
      transmit, based at least in part on contents of a knowledge base of the machine learning service, a recommendation to utilize a different split strategy from the set.
  - 20. The non-transitory computer-accessible storage medium as recited in claim 17, wherein to generate the token, the instructions when executed on the one or more processors:
    - apply one or more hash functions to an object formed by combining a plurality of token-contributor elements including (a) the identifier of the storage object and (b) the key value.
  - 21. The non-transitory computer-accessible storage medium as recited in claim 17, wherein the instructions when executed on the one or more processors:
    - increment, after the particular observation record has been assigned, a count of the number of observation records that have been assigned to the first split sub set;
      
      determine, based at least in part on a comparison of the count with a total population of the data set, that additional observation records of the data set do not have to be examined; and
      
      provide an indication to the destination that assignment of observation records to the first split subset has been completed.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Zheng, Tianming, Correa, Nicolle M., Dirac, Leo Parker, Jesensky, James Joseph, Steele, Robert Matthias
Primary Examiner(s)
Pyo, Monica M

Application Number

US14/950,953
Time in Patent Office

1,344 Days
Field of Search

707747
US Class Current
CPC Class Codes

G06F 16/137   Hash-based content-based in...

G06N 20/00   Machine learning

G06N 7/01   Probabilistic graphical mod...

G06Q 30/06   Buying, selling or leasing ...

Consistent randomized record-level splitting of machine learning data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

48 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Consistent randomized record-level splitting of machine learning data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

48 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links