LEVERAGE OFFLOAD PROGRAMMING MODEL FOR LOCAL CHECKPOINTS
First Claim
1. A method implemented in a computing environment including a compute entity comprising a source commutatively coupled to a plurality of compute entities comprising sinks, the method comprising:
- managing execution of a job comprising executable code using the source;
offloading sections of the executable code to the plurality of sinks;
storing, for each section of code that is offloaded to a sink, offload context information identifying the section of code that is offloaded and the sink it is offloaded to;
receiving, for offloaded sections of code, results generated by the sinks to which the sections of code were offloaded;
detecting that a sink has failed to successfully execute a section of code that was offloaded to the sink, and in response thereto,retrieving the offload context information corresponding to the section of code offloaded to the sink; and
offloading the section of code to another sink for execution.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods, apparatus, and systems for leveraging an offload programming model for local checkpoints. Compute entities in a computing environment are implemented as one or more sources and a larger number of sinks. A job dispatcher dispatches jobs comprising executable code to the source(s), and the execution of the job code is managed by the source(s). Code sections in the job code designated for offload are offloaded to the sinks by creating offload context information. In conjunction with each offload, an offload object is generated and written to storage. The offloaded code sections are executed by the sinks, which return result data to the source, e.g., via a direct write to a memory buffer specified in the offload context information. The health of the sinks is monitored to detect failures, and upon a failure the source retrieves the offload object corresponding to the code section offloaded to the failed sink, regenerates the offload context information for the code section and sends this to another sink for execution.
35 Citations
25 Claims
-
1. A method implemented in a computing environment including a compute entity comprising a source commutatively coupled to a plurality of compute entities comprising sinks, the method comprising:
-
managing execution of a job comprising executable code using the source; offloading sections of the executable code to the plurality of sinks; storing, for each section of code that is offloaded to a sink, offload context information identifying the section of code that is offloaded and the sink it is offloaded to; receiving, for offloaded sections of code, results generated by the sinks to which the sections of code were offloaded; detecting that a sink has failed to successfully execute a section of code that was offloaded to the sink, and in response thereto, retrieving the offload context information corresponding to the section of code offloaded to the sink; and offloading the section of code to another sink for execution. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A server platform comprising:
-
a host processor coupled to host memory; a plurality of expansion slots, communicatively-coupled to the host processor; one or more multiple-integrated core (MIC) devices installed in respective expansion slots, each MIC device including a plurality of processor cores and on-board memory; and a network adaptor, installed in either an expansion slot or implemented as a component that is communicatively-coupled to the host processor; wherein the server platform further includes software instructions configured to be executed on the host processor and a plurality of the processor cores in the MIC device to enable the server platform to; configure the host processor as a source and at least a portion of the plurality of processor cores in the MIC device as sinks; configure memory mappings between the on-board MIC memory and the host memory; manage execution of a job comprising executable code on the host processor; offload sections of the executable code to the plurality of sinks; transmit for storage on a non-volatile storage device assessable via a network coupled to the network adaptor, for each section of code that is offloaded to a sink, offload context information identifying the section of code that is offloaded and the sink it is offloaded to; execute the offloaded code section on the sinks to generate result data; store the result data in memory buffers accessible to the host processor; detect that a sink has failed to successfully execute a section of code that was offloaded to the sink, and in response thereto, retrieve the offload context information corresponding to the section of code offloaded to the sink that was previously stored; and offload the section of code to another sink for execution. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. At least one tangible non-transitory machine-readable medium having instructions stored thereon configured to be executed by compute entities in a server platform including,
a host processor comprising a first compute entity; -
host memory coupled to the host processor; a plurality of expansion slots, communicatively-coupled to the host processor; one or more multiple-integrated core (MIC) devices installed in respective expansion slots, each MIC device including a plurality of processor cores comprising compute entities and on-board memory; and a network adaptor, installed in either an expansion slot or implemented as a component that is communicatively-coupled to the host processor; wherein execution of the instructions by the host processor and processor cores in the one or more MIC devices enable the server platform to; configure the host processor as a source and at least a portion of the plurality of processor cores in the one or more MIC devices as sinks; configure, for each MIC device, memory mappings between the on-board MIC memory of the MIC device and the host memory; manage execution of a job comprising executable code on the host processor; offload sections of the executable code to the plurality of sinks; transmit for storage on a non-volatile storage device assessable via a network coupled to the network adaptor, for each section of code that is offloaded to a sink, offload context information identifying the section of code that is offloaded and the sink it is offloaded to; execute the offloaded code section on the sinks to generate result data; store the result data in memory buffers accessible to the host processor; detect that a sink has failed to successfully execute a section of code that was offloaded to the sink, and in response thereto, retrieve the offload context information corresponding to the section of code offloaded to the sink that was previously stored; and offload the section of code to another sink for execution. - View Dependent Claims (23, 24, 25)
-
Specification