INPUT PROCESSING FOR MACHINE LEARNING
First Claim
Patent Images
1. A system, comprising:
- one or more computing devices configured to;
receive, via a programmatic interface of a machine learning service of a provider network, a request to extract observation records of a particular data set from one or more file sources, wherein a size of the particular data set exceeds a size of a first memory portion available for the particular data set at a first server of the machine learning service;
map the particular data set to a plurality of contiguous chunks, including a particular contiguous chunk whose size does not exceed the first memory portion;
generate, based at least in part on a filtering descriptor indicated in the request, a filtering plan to perform a sequence of chunk-level filtering operations on the plurality of contiguous chunks, wherein an operation type of individual ones of the sequence of filtering operations comprises one or more of;
(a) sampling, (b) shuffling, (c) splitting, or (d) partitioning for parallel computation, and wherein the filtering plan includes a first chunk-level filtering operation followed by a second chunk-level filtering operation;
execute, to implement the first chunk-level filtering operation, at least a set of reads directed to one or more persistent storage devices at which at least a subset of the plurality of contiguous chunks are stored, wherein, subsequent to the set of reads, the first memory portion comprises at least the particular contiguous chunk;
implement the second chunk-level filtering operation on an in-memory result set of the first chunk-level filtering operation, without re-reading from the one or more persistent storage devices, and without copying the particular contiguous chunk; and
extract a plurality of observation records from an output of the sequence of chunk-level filtering operations.
1 Assignment
0 Petitions
Accused Products
Abstract
A record extraction request for a data set is received at a machine learning service. A plan to perform one or more chunk-level operations (such as sampling, shuffling, splitting or partitioning for parallel computation) on chunks of the data set is generated. A set of data transfers that results in a particular chunk being stored in a particular server'"'"'s memory is initiated to implement the first chunk-level operation of the sequence. A second operation such as another filtering operation or a feature processing operation is performed on a result set of the first chunk-level operation.
-
Citations
22 Claims
-
1. A system, comprising:
one or more computing devices configured to; receive, via a programmatic interface of a machine learning service of a provider network, a request to extract observation records of a particular data set from one or more file sources, wherein a size of the particular data set exceeds a size of a first memory portion available for the particular data set at a first server of the machine learning service; map the particular data set to a plurality of contiguous chunks, including a particular contiguous chunk whose size does not exceed the first memory portion; generate, based at least in part on a filtering descriptor indicated in the request, a filtering plan to perform a sequence of chunk-level filtering operations on the plurality of contiguous chunks, wherein an operation type of individual ones of the sequence of filtering operations comprises one or more of;
(a) sampling, (b) shuffling, (c) splitting, or (d) partitioning for parallel computation, and wherein the filtering plan includes a first chunk-level filtering operation followed by a second chunk-level filtering operation;execute, to implement the first chunk-level filtering operation, at least a set of reads directed to one or more persistent storage devices at which at least a subset of the plurality of contiguous chunks are stored, wherein, subsequent to the set of reads, the first memory portion comprises at least the particular contiguous chunk; implement the second chunk-level filtering operation on an in-memory result set of the first chunk-level filtering operation, without re-reading from the one or more persistent storage devices, and without copying the particular contiguous chunk; and extract a plurality of observation records from an output of the sequence of chunk-level filtering operations. - View Dependent Claims (2, 3, 4, 5)
-
6. A method, comprising:
performing, on one or more computing devices; receiving, at a machine learning service, a request to extract observation records of a particular data set from one or more data sources; mapping the particular data set to a plurality of chunks including a particular chunk; generating a filtering plan to perform a sequence of chunk-level filtering operations on the plurality of chunks, wherein an operation type of individual ones of the sequence of filtering operations comprises one or more of;
(a) sampling, (b) shuffling, (c) splitting, or (d) partitioning for parallel computation, and wherein the filtering plan includes a first chunk-level filtering operation followed by a second chunk-level filtering operation;initiating, to implement the first chunk-level filtering operation, a set of data transfers directed to one or more persistent storage devices at which at least a subset of the plurality of chunks is stored, wherein, subsequent to the set of data transfers, the first memory portion comprises at least the particular chunk; implementing the second chunk-level filtering operation on an in-memory result set of the first chunk-level filtering operation; and extracting a plurality of observation records from an output of the sequence of chunk-level filtering operations. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
17. A non-transitory computer-accessible storage medium storing program instructions that when executed on one or more processors:
-
generate, in response to receiving a request to extract observation records of a particular data set from one or more data sources at a machine learning service, a plan to perform one or more chunk-level operations including a first chunk-level operation on a plurality of chunks of the particular data set, wherein an operation type of the first chunk-level operation comprises one or more of;
(a) sampling, (b) shuffling, (c) splitting, or (d) partitioning for parallel computation;initiate, to implement the first chunk-level operation, a set of data transfers directed to one or more persistent storage devices at which at least a subset of the plurality of chunks is stored, wherein, subsequent to the set of data transfers, a first memory portion of a particular server of the machine learning service comprises at least a particular chunk of the plurality of chunks; and implement a second operation on a result set of the first chunk-level operation, wherein the second operation comprises one or more of;
(a) another filtering operation, (b) a feature processing operation or (c) an aggregation operation. - View Dependent Claims (18, 19, 20, 21, 22)
-
Specification