Consistent filtering of machine learning data
First Claim
Patent Images
1. A system, comprising:
- one or more computing devices configured to;
generate consistency metadata to be used for one or more training-and-evaluation iterations of a machine learning model, wherein the consistency metadata comprises at least a particular initialization parameter value for a pseudo-random number source;
sub-divide an address space of a particular data set of the machine learning model into a plurality of contiguous chunks, including a first chunk comprising a first plurality of observation records, and a second chunk comprising a second plurality of observation records;
retrieve, from one or more persistent storage devices, observation records of the first chunk into a memory of a first server, and observation records of the second chunk into a memory of a second server;
select, using a first set of pseudo-random numbers, a first training set from the plurality of contiguous chunks, wherein the first training set includes at least a portion of the first chunk, wherein observation records of the first training set are used to train the machine learning model during a first training-and-evaluation iteration of the one or more training-and-evaluation iterations, and wherein the first set of pseudo-random numbers is obtained using the consistency metadata; and
select, using a second set of pseudo-random numbers, a first test set from the plurality of contiguous chunks, wherein the first test set includes at least a portion of the second chunk, wherein observation records of the first test set are used to evaluate the machine learning model during the first training-and-evaluation iteration, and wherein the second set of pseudo-random numbers is obtained using the consistency metadata;
wherein the first and second sets of pseudo-random numbers cause individual observation record to be selected for exactly one of the first training set or the first test set.
1 Assignment
0 Petitions
Accused Products
Abstract
Consistency metadata, including a parameter for a pseudo-random number source, are determined for training-and-evaluation iterations of a machine learning model. Using the metadata, a first training set comprising records of at least a first chunk is identified from a plurality of chunks of a data set. The first training set is used to train a machine learning model during a first training-and-evaluation iteration. A first test set comprising records of at least a second chunk is identified using the metadata, and is used to evaluate the model during the first training-and-evaluation iteration.
-
Citations
20 Claims
-
1. A system, comprising:
one or more computing devices configured to; generate consistency metadata to be used for one or more training-and-evaluation iterations of a machine learning model, wherein the consistency metadata comprises at least a particular initialization parameter value for a pseudo-random number source; sub-divide an address space of a particular data set of the machine learning model into a plurality of contiguous chunks, including a first chunk comprising a first plurality of observation records, and a second chunk comprising a second plurality of observation records; retrieve, from one or more persistent storage devices, observation records of the first chunk into a memory of a first server, and observation records of the second chunk into a memory of a second server; select, using a first set of pseudo-random numbers, a first training set from the plurality of contiguous chunks, wherein the first training set includes at least a portion of the first chunk, wherein observation records of the first training set are used to train the machine learning model during a first training-and-evaluation iteration of the one or more training-and-evaluation iterations, and wherein the first set of pseudo-random numbers is obtained using the consistency metadata; and select, using a second set of pseudo-random numbers, a first test set from the plurality of contiguous chunks, wherein the first test set includes at least a portion of the second chunk, wherein observation records of the first test set are used to evaluate the machine learning model during the first training-and-evaluation iteration, and wherein the second set of pseudo-random numbers is obtained using the consistency metadata; wherein the first and second sets of pseudo-random numbers cause individual observation record to be selected for exactly one of the first training set or the first test set. - View Dependent Claims (2, 3, 4, 5)
-
6. A method, comprising:
one or more computing devices configured to; determining consistency metadata to be used for one or more training-and-evaluation iterations of a machine learning model, wherein the consistency metadata comprises at least a particular parameter value for a pseudo-random number source; sub-dividing an address space of a particular data set of the machine learning model into a plurality of contiguous chunks, including a first chunk comprising a first plurality of observation records, and a second chunk comprising a second plurality of observation records; selecting, using a first set of pseudo-random numbers obtained via the consistency metadata, a first training set from the plurality of contiguous chunks, wherein the first training set includes at least a portion of the first chunk, and wherein observation records of the first training set are used to train the machine learning model during a first training-and-evaluation iteration of the one or more training-and-evaluation iterations; and selecting, using a second set of pseudo-random numbers obtained via the consistency metadata, a first test set from the plurality of contiguous chunks, wherein the first test set includes at least a portion of the second chunk, and wherein observation records of the first test set are used to evaluate the machine learning model during the first training-and-evaluation iteration; wherein the first and second sets of pseudo-random numbers cause individual observation record to be selected for exactly one of the first training set or the first test set. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15)
-
16. A non-transitory computer-accessible storage medium storing program instructions that when executed on one or more processors:
-
determine consistency metadata to be used for one or more training-and-evaluation iterations of a machine learning model, wherein the consistency metadata comprises at least a particular parameter value for a pseudo-random number source; select, using a first set of pseudo-random numbers obtained via the consistency metadata, a first training set from a plurality of contiguous chunks of a particular data set, wherein individual ones of the plurality of chunks comprise one or more observation records, wherein the first training set includes at least a portion of a first chunk of the plurality of contiguous chunks, and wherein observation records of the first training set are used to train the machine learning model during a first training-and-evaluation iteration of the one or more training-and-evaluation iterations; and select, using a second set of pseudo-random numbers obtained via the consistency metadata, a first test set from the plurality of contiguous chunks, wherein the first test set includes at least a portion of a second chunk of the plurality of contiguous chunks, and wherein observation records of the first test set are used to evaluate the machine learning model during the first training-and-evaluation iteration; wherein the first and second sets of pseudo-random numbers cause individual observation record to be selected for exactly one of the first training set or the first test set. - View Dependent Claims (17, 18, 19, 20)
-
Specification