Progressive retry method and apparatus for software failure recovery in multi-process message-passing applications
First Claim
1. A method for bypassing software faults in a system executing a plurality of concurrent processes, said processes passing messages between one another, said fault tolerant method comprising the steps of:
- monitoring one or more of said processes for software faults;
periodically performing a checkpoint of critical data associated with each of said monitored processes;
logging said messages that are received by each monitored process in a message log associated with each monitored process; and
performing a progressive retry algorithm upon detection of a fault during said monitoring step to recover said faulty process, said progressive retry algorithm including a plurality of retry steps which gradually increase the scope of the recovery roll back when a previous retry step fails to bypass said detected fault.
10 Assignments
0 Petitions
Accused Products
Abstract
A progressive retry recovery system based on checkpointing, message logging, rollback, message replaying and message reordering is disclosed. The disclosed progressive retry system minimizes the number of involved processes as well as the total rollback distance. The progressive retry method consists of a number of retry steps which gradually increase the scope of the rollback when a previous retry step fails. Step one attempts to bypass a software fault by having the faulty process replay the messages in its message log. Step two will attempt to bypass the software fault by having the faulty process reorder and then replay the messages in its message log. Step three will attempt to bypass the software fault by having the processes which have sent messages to the faulty process resend those messages to the faulty process. Step four will attempt to bypass the software fault by having the processes which have sent messages to the faulty process reorder and then resend their in-transit messages. Step five will attempt to bypass the software fault by implementing a large scope roll back of all monitored processes to the latest consistent global checkpoint. A mechanism is included for verifying the piecewise deterministic assumption.
-
Citations
20 Claims
-
1. A method for bypassing software faults in a system executing a plurality of concurrent processes, said processes passing messages between one another, said fault tolerant method comprising the steps of:
-
monitoring one or more of said processes for software faults; periodically performing a checkpoint of critical data associated with each of said monitored processes; logging said messages that are received by each monitored process in a message log associated with each monitored process; and performing a progressive retry algorithm upon detection of a fault during said monitoring step to recover said faulty process, said progressive retry algorithm including a plurality of retry steps which gradually increase the scope of the recovery roll back when a previous retry step fails to bypass said detected fault. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for bypassing software faults in a system executing a plurality of concurrent processes, said processes passing messages between one another, said fault tolerant method comprising the steps of:
-
monitoring one or more of said processes for software faults; periodically performing a checkpoint of critical data associated with each of said monitored processes; logging said messages that are received by each monitored process in a message log associated with each monitored process; and performing a progressive retry algorithm upon detection of a fault during said monitoring step to recovery said faulty process, said progressive retry algorithm comprising the steps of; performing a receiver replay step which will restore said faulty process to said latest checkpoint associated with said process and then replay said messages in said message log associated with said faulty process that were received by said faulty process since said latest checkpoint up to the process state of the faulty process at the point of said detected fault; performing a receiver reorder step if said receiver replay step does not bypass said software fault, wherein at least two of said received messages in said message log of said faulty process will be reordered before being replayed; performing a sender replay step if said receiver reorder step does not bypass said software fault, wherein one or more of said messages in said message log associated with said faulty process that were received by said faulty process after its latest checkpoint will be discarded and wherein each of said sending processes that sent said discarded messages to said faulty process will resend messages to said faulty process during said sender replay step; performing a sender reorder step if said sender replay step does not bypass said software fault, wherein each of said sending processes that sent messages to said faulty process which were received by said faulty process since its latest checkpoint will reorder at least two of said messages in its message log before replaying said messages; and performing a large scope roll back step if said sender reorder step does not bypass said software fault, said large scope roll back step rolling back each of said monitored processes to a latest consistent global checkpoint. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A method for bypassing software faults in a system executing a plurality of concurrent processes, said processes communicating by means of a message passing mechanism, said fault tolerant method comprising the steps of:
-
monitoring one or more of said processes for software faults; periodically performing a checkpoint of critical data associated with each of said monitored processes; logging said messages that are received by each of said monitored processes in a message log associated with each monitored process, said log maintaining the processing order in which each of said messages were processed by said associated process, the receipt and logging of each message by said associated process forms a logical checkpoint; and performing a progressive retry algorithm upon detection of a fault during said monitoring step, said progressive retry algorithm employing a roll back propagation algorithm to compute a recovery line at the latest available actual or logical checkpoint for each of said monitored processes which has not been discarded, said roll back propagation algorithm enforcing the roll back propagation rule, said messages in said message logs that were received and processed by said associated processes between said latest actual checkpoint of said associated process and said computed recovery line being deterministic messages, said messages in said message logs that were sent before said computed recovery line and received after said recovery line being in-transit messages, said progressive retry algorithm comprising the steps of; performing a receiver replay step to restore said faulty process its latest checkpoint and to replay said messages in said message log associated with said faulty process that were received by said faulty process since said latest checkpoint up to the point of the detected fault; performing a receiver reorder step if said receiver replay step does not bypass said software fault, said receiver reorder step further comprising the steps of; discarding said processing order information for said messages in said message log associated with said faulty process that were received after its latest checkpoint; computing said recovery line; reordering said in-transit messages in said message log of said faulty process; and replaying said deterministic messages and replaying or resubmitting said in-transit messages in said message logs of each of said monitored processes whose current process state is not at said recovery line; performing a sender replay step if said receiver reorder step does not bypass said software fault, said sender replay step further comprising the steps of; discarding said messages in said message log associated with said faulty process that were received by said faulty process after the last checkpoint performed for said faulty process; and replaying said deterministic messages and resubmitting said in-transit messages in said message logs of each of said monitored processes whose current process state is not at said recovery line, wherein each of said sending processes that sent said discarded messages to said faulty process during initial processing will resend messages to said faulty process during said sender replay step; performing a sender reorder step if said sender replay step does not bypass said software fault, said sender reorder step further comprising the steps of; discarding said processing order information for each of those messages in said message logs associated with said sending process that were received by said sending process after the logical checkpoint which is before the first message sent by said sending process to said faulty process since the latest checkpoint associated with said faulty process; recomputing said recovery line; reordering said in-transit messages in said message logs associated with said sending processes; and replaying said deterministic messages and replaying or resubmitting said in-transit messages in said message logs of each of said monitored processes whose current process state is not at said recovery line; performing a large scope roll back step if said sender reorder step does not bypass said software fault, said large scope roll back step rolling back each of said monitored processes to the latest consistent global checkpoint. - View Dependent Claims (20)
-
Specification