INPUT PROCESSING FOR MACHINE LEARNING

US 20150379072A1
Filed: 08/14/2014
Published: 12/31/2015
Est. Priority Date: 06/30/2014
Status: Active Grant

First Claim

Patent Images

1. A system, comprising:

one or more computing devices configured to;

receive, via a programmatic interface of a machine learning service of a provider network, a request to extract observation records of a particular data set from one or more file sources, wherein a size of the particular data set exceeds a size of a first memory portion available for the particular data set at a first server of the machine learning service;

map the particular data set to a plurality of contiguous chunks, including a particular contiguous chunk whose size does not exceed the first memory portion;

generate, based at least in part on a filtering descriptor indicated in the request, a filtering plan to perform a sequence of chunk-level filtering operations on the plurality of contiguous chunks, wherein an operation type of individual ones of the sequence of filtering operations comprises one or more of;

(a) sampling, (b) shuffling, (c) splitting, or (d) partitioning for parallel computation, and wherein the filtering plan includes a first chunk-level filtering operation followed by a second chunk-level filtering operation;

execute, to implement the first chunk-level filtering operation, at least a set of reads directed to one or more persistent storage devices at which at least a subset of the plurality of contiguous chunks are stored, wherein, subsequent to the set of reads, the first memory portion comprises at least the particular contiguous chunk;

implement the second chunk-level filtering operation on an in-memory result set of the first chunk-level filtering operation, without re-reading from the one or more persistent storage devices, and without copying the particular contiguous chunk; and

extract a plurality of observation records from an output of the sequence of chunk-level filtering operations.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A record extraction request for a data set is received at a machine learning service. A plan to perform one or more chunk-level operations (such as sampling, shuffling, splitting or partitioning for parallel computation) on chunks of the data set is generated. A set of data transfers that results in a particular chunk being stored in a particular server'"'"'s memory is initiated to implement the first chunk-level operation of the sequence. A second operation such as another filtering operation or a feature processing operation is performed on a result set of the first chunk-level operation.

Citations

22 Claims

1. A system, comprising:
- one or more computing devices configured to;
  
  receive, via a programmatic interface of a machine learning service of a provider network, a request to extract observation records of a particular data set from one or more file sources, wherein a size of the particular data set exceeds a size of a first memory portion available for the particular data set at a first server of the machine learning service;
  
  map the particular data set to a plurality of contiguous chunks, including a particular contiguous chunk whose size does not exceed the first memory portion;
  
  generate, based at least in part on a filtering descriptor indicated in the request, a filtering plan to perform a sequence of chunk-level filtering operations on the plurality of contiguous chunks, wherein an operation type of individual ones of the sequence of filtering operations comprises one or more of;
  
  (a) sampling, (b) shuffling, (c) splitting, or (d) partitioning for parallel computation, and wherein the filtering plan includes a first chunk-level filtering operation followed by a second chunk-level filtering operation;
  
  execute, to implement the first chunk-level filtering operation, at least a set of reads directed to one or more persistent storage devices at which at least a subset of the plurality of contiguous chunks are stored, wherein, subsequent to the set of reads, the first memory portion comprises at least the particular contiguous chunk;
  
  implement the second chunk-level filtering operation on an in-memory result set of the first chunk-level filtering operation, without re-reading from the one or more persistent storage devices, and without copying the particular contiguous chunk; and
  
  extract a plurality of observation records from an output of the sequence of chunk-level filtering operations.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system as recited in claim 1, wherein the one or more computing devices are further configured to:
    - implement an intra-chunk filtering operation on a set of observation records identified within the particular contiguous chunk.
  - 3. The system as recited in claim 1, wherein the one or more computing devices are further configured to:
    - de-compress contents of the particular contiguous chunk in accordance with one or more de-compression parameters indicated in the request.
  - 4. The system as recited in claim 1, wherein the one or more computing devices are further configured to:
    - decrypt contents of the particular contiguous chunk in accordance with one or more decryption parameters indicated in the request.
  - 5. The system as recited in claim 1, wherein the one or more computing devices are further configured to:
    - provide a plurality of observation records obtained from the sequence as input for an execution of one or more of;
      
      (a) a feature processing recipe or (b) a machine learning model.

6. A method, comprising:
- performing, on one or more computing devices;
  
  receiving, at a machine learning service, a request to extract observation records of a particular data set from one or more data sources;
  
  mapping the particular data set to a plurality of chunks including a particular chunk;
  
  generating a filtering plan to perform a sequence of chunk-level filtering operations on the plurality of chunks, wherein an operation type of individual ones of the sequence of filtering operations comprises one or more of;
  
  (a) sampling, (b) shuffling, (c) splitting, or (d) partitioning for parallel computation, and wherein the filtering plan includes a first chunk-level filtering operation followed by a second chunk-level filtering operation;
  
  initiating, to implement the first chunk-level filtering operation, a set of data transfers directed to one or more persistent storage devices at which at least a subset of the plurality of chunks is stored, wherein, subsequent to the set of data transfers, the first memory portion comprises at least the particular chunk;
  
  implementing the second chunk-level filtering operation on an in-memory result set of the first chunk-level filtering operation; and
  
  extracting a plurality of observation records from an output of the sequence of chunk-level filtering operations.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 7. The method as recited in claim 6, wherein the one or more data sources comprise one or more storage objects including a particular storage object, wherein said mapping the particular data set into the plurality of chunks comprises determining, based at least in part on a chunk size parameter, a candidate offset within the particular storage object as a candidate ending boundary of the particular chunk, further comprising performing, by the one or more computing devices:
    - selecting, as an ending boundary of the particular chunk, a particular delimiter representing an ending boundary of a particular observation record within the particular storage object, wherein the particular delimiter is located at a different offset than the candidate offset.
  - 8. The method as recited in claim 7, wherein said selecting, as the ending boundary, the particular delimiter comprises:
    - identifying, in a sequential read of the particular storage object in order of increasing offsets, the first delimiter with an offset higher than the candidate offset as the ending boundary of the particular chunk.
  - 9. The method as recited in claim 6, wherein the one or more data sources comprise one or more of:
    - (a) a single-host file system, (b) a distributed file system, (c) a storage object accessible via a web service interface from a network-accessible storage service, (d) a storage volume presenting a block-level device interface, or (e) a database.
  - 10. The method as recited in claim 6, wherein the request is formatted in accordance with an application programming interface of the machine learning service.
  - 11. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - de-compressing contents of the particular chunk in accordance with one or more de-compression parameters indicated in the request.
  - 12. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - decrypting contents of the particular chunk in accordance with one or more decryption parameters indicated in the request.
  - 13. The method as recited in claim 6, wherein the plurality of observation records comprises a first observation record of a first record length, and a second observation record of a different record length.
  - 14. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - implementing an intra-chunk filtering operation on a set of observation records identified within the particular chunk.
  - 15. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - inserting a first job object representing the first chunk-level filtering operation in a collection of jobs to be scheduled at the machine learning service; and
      
      inserting a second job object representing the second chunk-level filtering operation in the collection, prior to a completion of the first chunk-level filtering operation.
  - 16. The method as recited in claim 6, further comprising performing, by the one or more computing devices:
    - providing the plurality of observation records extracted from the output of the sequence as input for an execution of one or more of;
      
      (a) a feature processing recipe or (b) a machine learning model.

17. A non-transitory computer-accessible storage medium storing program instructions that when executed on one or more processors:
- generate, in response to receiving a request to extract observation records of a particular data set from one or more data sources at a machine learning service, a plan to perform one or more chunk-level operations including a first chunk-level operation on a plurality of chunks of the particular data set, wherein an operation type of the first chunk-level operation comprises one or more of;
  
  (a) sampling, (b) shuffling, (c) splitting, or (d) partitioning for parallel computation;
  
  initiate, to implement the first chunk-level operation, a set of data transfers directed to one or more persistent storage devices at which at least a subset of the plurality of chunks is stored, wherein, subsequent to the set of data transfers, a first memory portion of a particular server of the machine learning service comprises at least a particular chunk of the plurality of chunks; and
  
  implement a second operation on a result set of the first chunk-level operation, wherein the second operation comprises one or more of;
  
  (a) another filtering operation, (b) a feature processing operation or (c) an aggregation operation.
- View Dependent Claims (18, 19, 20, 21, 22)
- - 18. The non-transitory computer-accessible storage medium as recited in claim 17, wherein the particular data set comprises contents of one or more of:
    - (a) a single-host file system, (b) a distributed file system, (c) a storage object accessible via a web service interface from a network-accessible storage service, (d) a storage volume presenting a block-level device interface, or (e) a database.
  - 19. The non-transitory computer-accessible storage medium as recited in claim 17, wherein the second operation comprises an intra-chunk filtering operation.
  - 20. The non-transitory computer-accessible storage medium as recited in claim 17, wherein the second operation comprises a cross-chunk filtering operation performed on a plurality of observation records including a first observation record identified within the particular chunk and a second observation record identified within a different chunk of the plurality of chunks.
  - 21. The non-transitory computer-accessible storage medium as recited in claim 17, wherein the second operation is an in-memory operation performed without copying the particular chunk to a different persistent storage device and without re-reading contents of the particular chunk from the one or more persistent storage devices.
  - 22. The non-transitory computer-accessible storage medium as recited in claim 17, wherein the operation type of the first chunk-level operation is partitioning for a parallel computation, wherein the first chunk-level operation includes a plurality of model training operations including a first training operation and a second training operation, wherein an execution duration of the first training operation overlaps at least in part with an execution duration of the second training operation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
DIRAC, LEO PARKER, LI, JIN, ZHENG, TIANMING, ZHUO, DONGHUI, RAMAKRISHNAN, RAKESH

Granted Patent

US 11,100,420 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06N 20/00 Machine learning

INPUT PROCESSING FOR MACHINE LEARNING

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

INPUT PROCESSING FOR MACHINE LEARNING

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links