Resilient programming frameworks for handling failures in parallel programs
First Claim
1. An information processing system capable of supporting resilient execution of applications written in a programming language with exception handling, the information processing system comprising:
- memory;
persistent memory for storing data and computer instructions;
a resilient store, communicatively coupled with the memory and the persistent memory, wherein application state information stored in the resilient store can be accessed in response to detection of a failure of an application executing in the information processing system;
a resilient executor, communicatively coupled with the memory and the persistent memory, for executing computations of applications while detecting failures in the execution of the computations; and
a processor, communicatively coupled with the resilient executor, the resilient store, the memory, the persistent memory, and wherein the processor, responsive to executing computer instructions, performs operations comprising;
periodically checkpointing an application state in the resilient store;
executing, with the resilient executor, computations of the application while detecting failures in the execution of the computations, wherein the resilient executor includes computer code which is part of the application;
restoring, based on the resilient executor detecting a failure in the execution of a computation of the application by catching with said computer code at least one exception, application state information for the application from a checkpoint in the resilient store; and
resuming, with the resilient executor, execution of the computation of the application with the restored application state information.
2 Assignments
0 Petitions
Accused Products
Abstract
An information processing system, computer readable storage medium, and method for supporting resilient execution of computer programs. A method provides a resilient store wherein information in the resilient store can be accessed in the event of a failure. The method periodically checkpoints application state in the resilient store. A resilient executor comprises software which executes applications by catching failures. The method uses the resilient executor to execute at least one application. In response to the resilient executor detecting a failure, restoring application state information to the at least one application from a checkpoint stored in the resilient store, the resilient executor resuming execution of the at least one application with the restored application state information.
-
Citations
6 Claims
-
1. An information processing system capable of supporting resilient execution of applications written in a programming language with exception handling, the information processing system comprising:
-
memory; persistent memory for storing data and computer instructions; a resilient store, communicatively coupled with the memory and the persistent memory, wherein application state information stored in the resilient store can be accessed in response to detection of a failure of an application executing in the information processing system; a resilient executor, communicatively coupled with the memory and the persistent memory, for executing computations of applications while detecting failures in the execution of the computations; and a processor, communicatively coupled with the resilient executor, the resilient store, the memory, the persistent memory, and wherein the processor, responsive to executing computer instructions, performs operations comprising; periodically checkpointing an application state in the resilient store; executing, with the resilient executor, computations of the application while detecting failures in the execution of the computations, wherein the resilient executor includes computer code which is part of the application; restoring, based on the resilient executor detecting a failure in the execution of a computation of the application by catching with said computer code at least one exception, application state information for the application from a checkpoint in the resilient store; and resuming, with the resilient executor, execution of the computation of the application with the restored application state information. - View Dependent Claims (2)
-
-
3. A computer readable storage medium, comprising computer instructions which, responsive to being executed by a processor, cause the processor to perform operations for supporting resilient execution of applications written in a programming language with exception handling, the operations comprising:
-
providing a resilient store wherein information in the resilient store can be accessed in the event of a failure; periodically checkpointing an application state in the resilient store; providing a resilient executor which comprises software which executes applications while detecting failures of the executing applications; using the resilient executor to execute at least one application, wherein the resilient executor includes computer code which is part of the at least one application; and in response to the resilient executor detecting a failure by catching with said computer code at least one exception, restoring application state information to the at least one application from a checkpoint stored in the resilient store; and the resilient executor resuming execution of the at least one application with the restored application state information. - View Dependent Claims (4, 5, 6)
-
Specification