Consistent filtering of machine learning data

US 10,540,606 B2
Filed: 08/14/2014
Issued: 01/21/2020
Est. Priority Date: 06/30/2014
Status: Active Grant

First Claim

Patent Images

1. A system, comprising:

one or more computing devices configured to;

generate consistency metadata to be used for one or more training-and-evaluation iterations of a machine learning model, wherein the consistency metadata comprises at least a particular initialization parameter value for a pseudo-random number source;

sub-divide an address space of a particular data set of the machine learning model into a plurality of contiguous chunks, including a first chunk comprising a first plurality of observation records, and a second chunk comprising a second plurality of observation records;

retrieve, from one or more persistent storage devices, observation records of the first chunk into a memory of a first server, and observation records of the second chunk into a memory of a second server;

select, using a first set of pseudo-random numbers, a first training set from the plurality of contiguous chunks, wherein the first training set includes at least a portion of the first chunk, wherein observation records of the first training set are used to train the machine learning model during a first training-and-evaluation iteration of the one or more training-and-evaluation iterations, and wherein the first set of pseudo-random numbers is obtained using the consistency metadata; and

select, using a second set of pseudo-random numbers, a first test set from the plurality of contiguous chunks, wherein the first test set includes at least a portion of the second chunk, wherein observation records of the first test set are used to evaluate the machine learning model during the first training-and-evaluation iteration, and wherein the second set of pseudo-random numbers is obtained using the consistency metadata;

wherein the first and second sets of pseudo-random numbers cause individual observation record to be selected for exactly one of the first training set or the first test set.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Consistency metadata, including a parameter for a pseudo-random number source, are determined for training-and-evaluation iterations of a machine learning model. Using the metadata, a first training set comprising records of at least a first chunk is identified from a plurality of chunks of a data set. The first training set is used to train a machine learning model during a first training-and-evaluation iteration. A first test set comprising records of at least a second chunk is identified using the metadata, and is used to evaluate the model during the first training-and-evaluation iteration.

Citations

20 Claims

1. A system, comprising:
- one or more computing devices configured to;
  
  generate consistency metadata to be used for one or more training-and-evaluation iterations of a machine learning model, wherein the consistency metadata comprises at least a particular initialization parameter value for a pseudo-random number source;
  
  sub-divide an address space of a particular data set of the machine learning model into a plurality of contiguous chunks, including a first chunk comprising a first plurality of observation records, and a second chunk comprising a second plurality of observation records;
  
  retrieve, from one or more persistent storage devices, observation records of the first chunk into a memory of a first server, and observation records of the second chunk into a memory of a second server;
  
  select, using a first set of pseudo-random numbers, a first training set from the plurality of contiguous chunks, wherein the first training set includes at least a portion of the first chunk, wherein observation records of the first training set are used to train the machine learning model during a first training-and-evaluation iteration of the one or more training-and-evaluation iterations, and wherein the first set of pseudo-random numbers is obtained using the consistency metadata; and
  
  select, using a second set of pseudo-random numbers, a first test set from the plurality of contiguous chunks, wherein the first test set includes at least a portion of the second chunk, wherein observation records of the first test set are used to evaluate the machine learning model during the first training-and-evaluation iteration, and wherein the second set of pseudo-random numbers is obtained using the consistency metadata;
  
  wherein the first and second sets of pseudo-random numbers cause individual observation record to be selected for exactly one of the first training set or the first test set.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system as recited in claim 1, wherein the one or more computing devices are further configured to:
    - insert a first job corresponding to the selection of the first training set in a collection of jobs to be scheduled at a machine learning service, and a second job corresponding to the selection of the first test set in the collection; and
      
      schedule the second job for execution asynchronously with respect to the first job.
  - 3. The system as recited in claim 1, wherein the one or more computing devices are configured to:
    - receive, from a client of a machine learning service, a request for the one or more training-and-evaluation iterations, wherein the request indicates at least a portion of the consistency metadata.
  - 4. The system as recited in claim 1, wherein the consistency metadata is based at least in part on an identifier of a data object in which one or more observation records of the particular data set are stored.
  - 5. The system as recited in claim 1, wherein the one or more computing devices are further configured to:
    - reorder observation records of the first chunk prior to presenting the observation records of the first training set as input to the machine learning model.

6. A method, comprising:
- one or more computing devices configured to;
  
  determining consistency metadata to be used for one or more training-and-evaluation iterations of a machine learning model, wherein the consistency metadata comprises at least a particular parameter value for a pseudo-random number source;
  
  sub-dividing an address space of a particular data set of the machine learning model into a plurality of contiguous chunks, including a first chunk comprising a first plurality of observation records, and a second chunk comprising a second plurality of observation records;
  
  selecting, using a first set of pseudo-random numbers obtained via the consistency metadata, a first training set from the plurality of contiguous chunks, wherein the first training set includes at least a portion of the first chunk, and wherein observation records of the first training set are used to train the machine learning model during a first training-and-evaluation iteration of the one or more training-and-evaluation iterations; and
  
  selecting, using a second set of pseudo-random numbers obtained via the consistency metadata, a first test set from the plurality of contiguous chunks, wherein the first test set includes at least a portion of the second chunk, and wherein observation records of the first test set are used to evaluate the machine learning model during the first training-and-evaluation iteration;
  
  wherein the first and second sets of pseudo-random numbers cause individual observation record to be selected for exactly one of the first training set or the first test set.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 7. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - retrieving, from a persistent storage device into a memory of a first server, at least the first chunk prior to training the machine learning model during the first training-and-evaluation iteration; and
      
      selecting, for a different training-and-evaluation iteration of the one or more training-and-evaluation iterations, (a) a different training set and (b) a different test set, without copying the first chunk from the memory of the first server to a different location.
  - 8. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - receiving, from a client of a machine learning service, a request for the one or more training-and-evaluation iterations, wherein the request indicates at least a portion of the consistency metadata.
  - 9. The method as recited in claim 8, wherein the request is formatted in accordance with a particular programmatic interface implemented by a machine learning service of a provider network.
  - 10. The method as recited in claim 6, wherein the consistency metadata is based at least in part on an identifier of a data object in which one or more observation records of the particular data set are stored.
  - 11. The method as recited in claim 6, wherein the first training set comprises at least one observation record of a third chunk of the plurality contiguous of chunks, and wherein the first test set comprises at least one observation record of the third chunk.
  - 12. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - shuffling observation records of the first chunk prior to presenting the observation records of the first training set as input to the machine learning model.
  - 13. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - determining a number of contiguous chunks into which the address space is to be sub-divided based at least in part on one or more of;
      
      (a) a size of available memory at a particular server or (b) a client request.
  - 14. The method as recited in claim 6, wherein the particular data set is stored in a plurality of data objects, further comprising performing, by the one or more computing devices:
    - determining an order in which the plurality of data objects are to be combined prior to sub-dividing the address space.
  - 15. The method as recited in claim 6, wherein the one or more training-and-evaluation iterations are cross-validation iterations of the machine learning model.

16. A non-transitory computer-accessible storage medium storing program instructions that when executed on one or more processors:
- determine consistency metadata to be used for one or more training-and-evaluation iterations of a machine learning model, wherein the consistency metadata comprises at least a particular parameter value for a pseudo-random number source;
  
  select, using a first set of pseudo-random numbers obtained via the consistency metadata, a first training set from a plurality of contiguous chunks of a particular data set, wherein individual ones of the plurality of chunks comprise one or more observation records, wherein the first training set includes at least a portion of a first chunk of the plurality of contiguous chunks, and wherein observation records of the first training set are used to train the machine learning model during a first training-and-evaluation iteration of the one or more training-and-evaluation iterations; and
  
  select, using a second set of pseudo-random numbers obtained via the consistency metadata, a first test set from the plurality of contiguous chunks, wherein the first test set includes at least a portion of a second chunk of the plurality of contiguous chunks, and wherein observation records of the first test set are used to evaluate the machine learning model during the first training-and-evaluation iteration;
  
  wherein the first and second sets of pseudo-random numbers cause individual observation record to be selected for exactly one of the first training set or the first test set.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The non-transitory computer-accessible storage medium as recited in claim 16, wherein the instructions when executed on the one or more processors:
    - initiate a retrieval, from a persistent storage device into a memory of a first server, of at least the first chunk prior to training the machine learning model during the first training-and-evaluation iteration; and
      
      select, for a different training-and-evaluation iteration of the one or more training-and-evaluation iterations, (a) a different training set and (b) a different test set, without copying the first chunk from the memory of the first server to a different location.
  - 18. The non-transitory computer-accessible storage medium as recited in claim 16, wherein the instructions when executed on the one or more processors:
    - receive, from a client of a machine learning service, a request for the one or more training-and-evaluation iterations, wherein the request indicates at least a portion of the consistency metadata.
  - 19. The non-transitory computer-accessible storage medium as recited in claim 16, wherein the consistency metadata is based at least in part on an identifier of a data object in which one or more observation records of the particular data set are stored.
  - 20. The non-transitory computer-accessible storage medium as recited in claim 16, wherein the instructions when executed on the one or more processors:
    - shuffle observation records of the first chunk prior to presenting the observation records of the first training set as input to the machine learning model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Dirac, Leo Parker, Li, Jin, Zheng, Tianming, Zhuo, Donghui
Primary Examiner(s)
Chen, Alan

Application Number

US14/460,314
Publication Number

US 20150379425A1
Time in Patent Office

1,986 Days
Field of Search
US Class Current
CPC Class Codes

G06N 20/00 Machine learning

Consistent filtering of machine learning data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Consistent filtering of machine learning data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links