Leverage offload programming model for local checkpoints
First Claim
1. A method implemented in a computing environment including a compute entity comprising a source commutatively coupled to a plurality of compute entities comprising sinks, the method comprising:
- managing, using the source, execution of a job comprising executable code;
employing the source to execute the job;
detecting, during execution of the job, sections of the executable code to be offloaded to sinks, each comprising a respective code section including a one or more functions to be offloaded to a sink;
constructing, for each code section to be offloaded to a sink, offload context information identifying one of the code section or indicia identifying the one or more functions and information identifying the sink;
offloading the code sections to the plurality of sinks;
storing, for each code section that is offloaded to a sink, the offload context information constructed for that code section;
receiving, for offloaded code sections, results generated by the sinks to which the code sections were offloaded;
detecting that a sink has failed to successfully execute a code section that was offloaded to the sink, and in response thereto,retrieving the offload context information corresponding to the code section offloaded to the sink; and
offloading the code section to another sink for execution.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods, apparatus, and systems for leveraging an offload programming model for local checkpoints. Compute entities in a computing environment are implemented as one or more sources and a larger number of sinks. A job dispatcher dispatches jobs comprising executable code to the source(s), and the execution of the job code is managed by the source(s). Code sections in the job code designated for offload are offloaded to the sinks by creating offload context information. In conjunction with each offload, an offload object is generated and written to storage. The offloaded code sections are executed by the sinks, which return result data to the source, e.g., via a direct write to a memory buffer specified in the offload context information. The health of the sinks is monitored to detect failures, and upon a failure the source retrieves the offload object corresponding to the code section offloaded to the failed sink, regenerates the offload context information for the code section and sends this to another sink for execution.
17 Citations
25 Claims
-
1. A method implemented in a computing environment including a compute entity comprising a source commutatively coupled to a plurality of compute entities comprising sinks, the method comprising:
-
managing, using the source, execution of a job comprising executable code; employing the source to execute the job; detecting, during execution of the job, sections of the executable code to be offloaded to sinks, each comprising a respective code section including a one or more functions to be offloaded to a sink; constructing, for each code section to be offloaded to a sink, offload context information identifying one of the code section or indicia identifying the one or more functions and information identifying the sink; offloading the code sections to the plurality of sinks; storing, for each code section that is offloaded to a sink, the offload context information constructed for that code section; receiving, for offloaded code sections, results generated by the sinks to which the code sections were offloaded; detecting that a sink has failed to successfully execute a code section that was offloaded to the sink, and in response thereto, retrieving the offload context information corresponding to the code section offloaded to the sink; and offloading the code section to another sink for execution. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A server platform comprising:
-
a host processor coupled to host memory; a plurality of expansion slots, communicatively-coupled to the host processor; one or more many integrated core (MIC) devices installed in respective expansion slots, each MIC device including a plurality of processor cores and on-board memory; and a network adaptor, installed in either an expansion slot or implemented as a component that is communicatively-coupled to the host processor; wherein the server platform further includes software instructions configured to be executed on the host processor and a plurality of the processor cores in the MIC device to enable the server platform to; configure the host processor as a source and at least a portion of the plurality of processor cores in the MIC device as sinks; configure memory mappings between the on-board MIC memory and the host memory; manage execution of a job comprising executable code on the host processor; employ the source to execute the job; detect, during execution of the job, sections of the executable code to be offloaded to sinks, each comprising a respective code section including a one or more functions to be offloaded to a sink; construct, for each code section to be offloaded to a sink, offload context information identifying one of the code section or indicia identifying the one or more functions and information identifying the sink; offload the code sections to the plurality of sinks; transmit for storage on a non-volatile storage device accessible via a network coupled to the network adaptor, for each code section that is offloaded to a sink, offload context information identifying the code section that is offloaded and the sink it is offloaded to; execute the offloaded code section on the sinks to generate result data; store the result data in memory buffers accessible to the host processor; detect that a sink has failed to successfully execute a code section that was offloaded to the sink, and in response thereto, retrieve the offload context information corresponding to the code section offloaded to the sink that was previously stored; and offload the of code section to another sink for execution. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. At least one tangible non-transitory machine-readable medium having instructions stored thereon configured to be executed by compute entities in a server platform including,
a host processor comprising a first compute entity; -
host memory coupled to the host processor; a plurality of expansion slots, communicatively-coupled to the host processor; one or more many integrated core (MIC) devices installed in respective expansion slots, each MIC device including a plurality of processor cores comprising compute entities and on-board memory; and a network adaptor, installed in either an expansion slot or implemented as a component that is communicatively-coupled to the host processor; wherein execution of the instructions by the host processor and processor cores in the one or more MIC devices enable the server platform to; configure the host processor as a source and at least a portion of the plurality of processor cores in the one or more MIC devices as sinks; configure, for each MIC device, memory mappings between the on-board MIC memory of the MIC device and the host memory; manage execution of a job comprising executable code on the host processor; employ the source to execute the job; detect, during execution of the job, sections of the executable code to be offloaded to sinks, each comprising a respective code section including a one or more functions to be offloaded to a sink; construct, for each code section to be offloaded to a sink, offload context information identifying one of the code section or indicia identifying the one or more functions and information identifying the sink; offload the code sections to the plurality of sinks; transmit for storage on a non-volatile storage device accessible via a network coupled to the network adaptor, for each code section that is offloaded to a sink, offload context information identifying the code section that is offloaded and the sink it is offloaded to; execute the offloaded code section on the sinks to generate result data; store the result data in memory buffers accessible to the host processor; detect that a sink has failed to successfully execute a code section that was offloaded to the sink, and in response thereto, retrieve the offload context information corresponding to the code section offloaded to the sink that was previously stored; and offload the code section to another sink for execution. - View Dependent Claims (23, 24, 25)
-
Specification