Progressive retry method and apparatus having reusable software modules for software failure recovery in multi-process message-passing applications
First Claim
1. An apparatus for bypassing faults in an application process, said fault bypass apparatus comprising:
- at least one processor for executing a plurality of concurrent application processes;
a watchdog which includes an error detection monitor for monitoring one or more of said application processes and a restart subsystem for executing a progressive retry recovery algorithm for bypassing a fault detected by said monitor in one of said application processes; and
a memory device for storing a plurality of fault tolerant library functions which may be invoked from one or more of said application processes to make said application fault tolerant, said fault tolerant library including;
a checkpoint function for periodically performing a checkpoint of data associated with an application process;
a recover function for restoring the checkpointed data from the nonvolatile memory during a recovery mode;
a fault tolerant write function for logging each output message generated by said application processes in a sender log file before said message is transmitted by the application process, wherein said fault tolerant write function includes a mechanism for suppressing one or more of said outputs during a recovery mode; and
a fault tolerant read function for logging each input message received by said application processes in a receiver log file before they are processed by the receiving application process, wherein said fault tolerant read function will read data from a communication channel and log the received message in the receiver log file in a normal mode, and wherein the input data will be read from said receiver log file during a recovery mode.
7 Assignments
0 Petitions
Accused Products
Abstract
A progressive retry recovery system based on checkpointing, message logging, rollback, message replaying and message reordering is disclosed. The disclosed progressive retry system minimizes the number of involved processes as well as the total rollback distance. The progressive retry recovery system includes a fault tolerant software library which provides a number of functions which may be invoked by application processes to implement fault tolerance. Fault tolerant functions are provided for allowing an application process to generate a heartbeat message at specified intervals indicating that the application process is still active. In addition, fault tolerance implementation functions are provided for specifying critical memory, for executing checkpoints to store backup copies of critical data, and for restoring critical data during a recovery. In addition, functions are provided which process messages that are sent or received by an application process and maintain logs of the sent and received messages. The progressive retry recovery method consists of a number of retry steps which gradually increase the scope of the rollback when a previous retry step fails.
178 Citations
16 Claims
-
1. An apparatus for bypassing faults in an application process, said fault bypass apparatus comprising:
-
at least one processor for executing a plurality of concurrent application processes; a watchdog which includes an error detection monitor for monitoring one or more of said application processes and a restart subsystem for executing a progressive retry recovery algorithm for bypassing a fault detected by said monitor in one of said application processes; and a memory device for storing a plurality of fault tolerant library functions which may be invoked from one or more of said application processes to make said application fault tolerant, said fault tolerant library including; a checkpoint function for periodically performing a checkpoint of data associated with an application process; a recover function for restoring the checkpointed data from the nonvolatile memory during a recovery mode; a fault tolerant write function for logging each output message generated by said application processes in a sender log file before said message is transmitted by the application process, wherein said fault tolerant write function includes a mechanism for suppressing one or more of said outputs during a recovery mode; and a fault tolerant read function for logging each input message received by said application processes in a receiver log file before they are processed by the receiving application process, wherein said fault tolerant read function will read data from a communication channel and log the received message in the receiver log file in a normal mode, and wherein the input data will be read from said receiver log file during a recovery mode. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method for making an application process fault tolerant, said fault tolerant method comprising the steps of:
-
invoking functions in said application function from a fault tolerant library to make said application fault tolerant, said fault tolerant library including; a checkpoint function for periodically performing a checkpoint of data associated with an application process; a recover function for restoring the checkpointed data from the nonvolatile memory during a recovery mode; a fault tolerant write function for logging each output message generated by said application processes in a sender log file before said message is transmitted by the application process, wherein said fault tolerant write function includes a mechanism for suppressing one or more of said outputs during a recovery mode; and a fault tolerant read function for logging each input message received by said application processes in a receiver log file before they are processed by the receiving application process, wherein said fault tolerant read function will read data from a communication channel and log the received message in the receiver log file in a normal mode, and wherein the input data will be read from said receiver log file during a recovery mode; monitoring said application process for software faults; and executing a progressive retry recovery algorithm upon detection of a fault during said monitoring step to recover said faulty process, said progressive retry algorithm including a plurality of retry steps which gradually increase the scope of the recovery roll back when a previous retry step fails to bypass said detected fault. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
Specification