Progressive retry method and apparatus for software failure recovery in multi-process message-passing applications

US 5,590,277 A
Filed: 06/22/1994
Issued: 12/31/1996
Est. Priority Date: 06/22/1994
Status: Expired due to Term

First Claim

Patent Images

1. A method for bypassing software faults in a system executing a plurality of concurrent processes, said processes passing messages between one another, said fault tolerant method comprising the steps of:

monitoring one or more of said processes for software faults;

periodically performing a checkpoint of critical data associated with each of said monitored processes;

logging said messages that are received by each monitored process in a message log associated with each monitored process; and

performing a progressive retry algorithm upon detection of a fault during said monitoring step to recover said faulty process, said progressive retry algorithm including a plurality of retry steps which gradually increase the scope of the recovery roll back when a previous retry step fails to bypass said detected fault.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A progressive retry recovery system based on checkpointing, message logging, rollback, message replaying and message reordering is disclosed. The disclosed progressive retry system minimizes the number of involved processes as well as the total rollback distance. The progressive retry method consists of a number of retry steps which gradually increase the scope of the rollback when a previous retry step fails. Step one attempts to bypass a software fault by having the faulty process replay the messages in its message log. Step two will attempt to bypass the software fault by having the faulty process reorder and then replay the messages in its message log. Step three will attempt to bypass the software fault by having the processes which have sent messages to the faulty process resend those messages to the faulty process. Step four will attempt to bypass the software fault by having the processes which have sent messages to the faulty process reorder and then resend their in-transit messages. Step five will attempt to bypass the software fault by implementing a large scope roll back of all monitored processes to the latest consistent global checkpoint. A mechanism is included for verifying the piecewise deterministic assumption.

Citations

20 Claims

1. A method for bypassing software faults in a system executing a plurality of concurrent processes, said processes passing messages between one another, said fault tolerant method comprising the steps of:
- monitoring one or more of said processes for software faults;
  
  periodically performing a checkpoint of critical data associated with each of said monitored processes;
  
  logging said messages that are received by each monitored process in a message log associated with each monitored process; and
  
  performing a progressive retry algorithm upon detection of a fault during said monitoring step to recover said faulty process, said progressive retry algorithm including a plurality of retry steps which gradually increase the scope of the recovery roll back when a previous retry step fails to bypass said detected fault.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The fault tolerant method of claim 1, wherein said progressive retry algorithm comprises the steps of:
    - performing a receiver replay step to restore said faulty process to said latest checkpoint associated with said process and to replay said messages in said message log associated with said faulty process that were received by said faulty process since said latest checkpoint up to said point of said detected fault;
      
      performing a receiver reorder step if said receiver replay step does not bypass said software fault, wherein at least two of said received messages in said message log of said faulty process will be reordered before being replayed;
      
      performing a sender replay step if said receiver reorder step does not bypass said software fault, wherein one or more of said messages in said message log associated with said faulty process that were received by said faulty process after its latest checkpoint will be discarded and wherein each of said sending processes that sent said discarded messages to said faulty process will resend messages to said faulty process during said sender replay step;
      
      performing a sender reorder step if said sender replay step does not bypass said software fault, wherein each of said sending processes that sent messages to said faulty process which were received by the faulty process since its latest checkpoint will reorder at least two of said messages in its message log before replaying said messages; and
      
      performing a large scope roll back step if said sender reorder step does not bypass said software fault, said large scope roll back step rolling back each of said monitored processes to a latest consistent global checkpoint.
  - 3. The fault tolerant method of claim 1, wherein said step of logging said received messages logs the message content and an indication of the processing order of said associated message.
  - 4. The fault tolerant method of claim 2, wherein said step of logging said messages further includes the step of logging messages that are sent by each monitored process in a sender message log associated with each monitored process and wherein said messages that are regenerated during said recovery are compared to said messages stored in said sender log during initial processing to verify the piecewise deterministic assumption.
  - 5. The fault tolerant method of claim 2, wherein said progressive retry algorithm minimizes the scope of said roll back recovery including the number of said processes involved in said roll back and the total roll back distance.

6. A method for bypassing software faults in a system executing a plurality of concurrent processes, said processes passing messages between one another, said fault tolerant method comprising the steps of:
- monitoring one or more of said processes for software faults;
  
  periodically performing a checkpoint of critical data associated with each of said monitored processes;
  
  logging said messages that are received by each monitored process in a message log associated with each monitored process; and
  
  performing a progressive retry algorithm upon detection of a fault during said monitoring step to recovery said faulty process, said progressive retry algorithm comprising the steps of;
  
  performing a receiver replay step which will restore said faulty process to said latest checkpoint associated with said process and then replay said messages in said message log associated with said faulty process that were received by said faulty process since said latest checkpoint up to the process state of the faulty process at the point of said detected fault;
  
  performing a receiver reorder step if said receiver replay step does not bypass said software fault, wherein at least two of said received messages in said message log of said faulty process will be reordered before being replayed;
  
  performing a sender replay step if said receiver reorder step does not bypass said software fault, wherein one or more of said messages in said message log associated with said faulty process that were received by said faulty process after its latest checkpoint will be discarded and wherein each of said sending processes that sent said discarded messages to said faulty process will resend messages to said faulty process during said sender replay step;
  
  performing a sender reorder step if said sender replay step does not bypass said software fault, wherein each of said sending processes that sent messages to said faulty process which were received by said faulty process since its latest checkpoint will reorder at least two of said messages in its message log before replaying said messages; and
  
  performing a large scope roll back step if said sender reorder step does not bypass said software fault, said large scope roll back step rolling back each of said monitored processes to a latest consistent global checkpoint.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 7. The fault tolerant method of claim 6, wherein said step of logging said messages that are received by each of said monitored processes, further includes the step of maintaining the processing order in which each of said messages were processed by said associated process and wherein the receipt and logging of each message by said associated process forms a logical checkpoint.
  - 8. The fault tolerant method of claim 7, wherein said progressive retry algorithm employs a roll back propagation algorithm to compute a recovery line at the latest available actual or logical checkpoint for each of said monitored processes which has not been discarded, said roll back propagation algorithm enforcing a roll back propagation rule, said messages in said message logs that were received and processed by said associated processes between said latest actual checkpoint of said associated process and said computed recovery line being deterministic messages, said messages in said message logs that were sent before said computed recovery line and received after said recovery line being in-transit messages.
  - 9. The fault tolerant method of claim 8, wherein said receiver reorder step further comprises the steps of:
    - discarding said processing order information for said messages in said message log associated with said faulty process that were received after its latest checkpoint;
      
      computing said recovery line;
      
      reordering said in-transit messages in said message log of said faulty process; and
      
      replaying said deterministic messages and replaying or resubmitting said in-transit messages in said message logs of each of said monitored processes whose current process state is not at said computed recovery line.
  - 10. The fault tolerant method of claim 8, wherein said sender replay step further comprises the steps of:
    - discarding said messages in said message log associated with said faulty process that were received by said faulty process after the last checkpoint performed for said faulty process; and
      
      replaying said deterministic messages and resubmitting said in-transit messages in said message logs of each of said monitored processes whose current process state is not at said recovery line, wherein each of said sending processes that sent said discarded messages to said faulty process during initial processing will resend messages to said faulty process during said sender replay step.
  - 11. The fault tolerant method of claim 8, wherein said sender reorder step further comprises the steps of:
    - discarding said processing order information for each of those messages in said message logs associated with said sending process that were received by said sending process after the logical checkpoint which is before the first message sent by said sending process to said faulty process since the latest checkpoint associated with said faulty process;
      
      recomputing said recovery line;
      
      reordering said in-transit messages in said message logs associated with said sending processes; and
      
      replaying said deterministic messages and replaying or resubmitting said in-transit messages in said message logs of each of said monitored processes whose current process state is not at said recovery line.
  - 12. The fault tolerant method of claim 7, wherein said step of logging said messages further includes the step of logging messages that are sent by each monitored process in a sender message log associated with each monitored process.
  - 13. The fault tolerant method of claim 12, wherein said messages that are regenerated during said recovery are compared to said messages stored in said sender log during initial processing to verify the piecewise deterministic assumption.
  - 14. The fault tolerant method of claim 6, wherein said progressive retry algorithm minimizes the scope of said roll back recovery including the number of said processes involved in said recovery and the total roll back distance.
  - 15. The fault tolerant method of claim 8, wherein said roll back propagation rule requires that if a process which sends a message rolls back to its latest checkpoint and unsends a message, the process which receives that message must also roll back to unreceive the message.
  - 16. The fault tolerant method of claim 8, wherein said processes are executing on a plurality of processing nodes and wherein if one of said processing nodes fails, said processes executing on said failed node may be restarted on another node.
  - 17. The fault tolerant method of claim 16, wherein said checkpoint data and message logs are stored on backup nodes.
  - 18. The fault tolerant method of claim 8, wherein steps of maintaining a consistent global checkpoint and computing said recovery line are performed by a central recovery coordinator.

19. A method for bypassing software faults in a system executing a plurality of concurrent processes, said processes communicating by means of a message passing mechanism, said fault tolerant method comprising the steps of:
- monitoring one or more of said processes for software faults;
  
  periodically performing a checkpoint of critical data associated with each of said monitored processes;
  
  logging said messages that are received by each of said monitored processes in a message log associated with each monitored process, said log maintaining the processing order in which each of said messages were processed by said associated process, the receipt and logging of each message by said associated process forms a logical checkpoint; and
  
  performing a progressive retry algorithm upon detection of a fault during said monitoring step, said progressive retry algorithm employing a roll back propagation algorithm to compute a recovery line at the latest available actual or logical checkpoint for each of said monitored processes which has not been discarded, said roll back propagation algorithm enforcing the roll back propagation rule, said messages in said message logs that were received and processed by said associated processes between said latest actual checkpoint of said associated process and said computed recovery line being deterministic messages, said messages in said message logs that were sent before said computed recovery line and received after said recovery line being in-transit messages, said progressive retry algorithm comprising the steps of;
  
  performing a receiver replay step to restore said faulty process its latest checkpoint and to replay said messages in said message log associated with said faulty process that were received by said faulty process since said latest checkpoint up to the point of the detected fault;
  
  performing a receiver reorder step if said receiver replay step does not bypass said software fault, said receiver reorder step further comprising the steps of;
  
  discarding said processing order information for said messages in said message log associated with said faulty process that were received after its latest checkpoint;
  
  computing said recovery line;
  
  reordering said in-transit messages in said message log of said faulty process; and
  
  replaying said deterministic messages and replaying or resubmitting said in-transit messages in said message logs of each of said monitored processes whose current process state is not at said recovery line;
  
  performing a sender replay step if said receiver reorder step does not bypass said software fault, said sender replay step further comprising the steps of;
  
  discarding said messages in said message log associated with said faulty process that were received by said faulty process after the last checkpoint performed for said faulty process; and
  
  replaying said deterministic messages and resubmitting said in-transit messages in said message logs of each of said monitored processes whose current process state is not at said recovery line, wherein each of said sending processes that sent said discarded messages to said faulty process during initial processing will resend messages to said faulty process during said sender replay step;
  
  performing a sender reorder step if said sender replay step does not bypass said software fault, said sender reorder step further comprising the steps of;
  
  discarding said processing order information for each of those messages in said message logs associated with said sending process that were received by said sending process after the logical checkpoint which is before the first message sent by said sending process to said faulty process since the latest checkpoint associated with said faulty process;
  
  recomputing said recovery line;
  
  reordering said in-transit messages in said message logs associated with said sending processes; and
  
  replaying said deterministic messages and replaying or resubmitting said in-transit messages in said message logs of each of said monitored processes whose current process state is not at said recovery line;
  
  performing a large scope roll back step if said sender reorder step does not bypass said software fault, said large scope roll back step rolling back each of said monitored processes to the latest consistent global checkpoint.
- View Dependent Claims (20)
- - 20. The fault tolerant method according to claim 19, wherein said step of logging messages further includes the step of logging messages that are sent by each monitored process in a sender message log associated with each monitored process and wherein said messages that are regenerated during said recovery are compared to said messages stored in said sender log during initial processing to verify the piecewise deterministic assumption.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Fuchs, Wesley K., Wang, Yi-Min, Huang, Yennun
Primary Examiner(s)
Beausoliel, Jr., Robert W.
Assistant Examiner(s)
Decady, Albert

Application Number

US08/263,978
Time in Patent Office

923 Days
Field of Search

395/181, 395/182.13, 395/182.14, 395/182.15, 395/182.11, 395/182.18, 395/183.01, 395/183.11, 395/183.14, 364/266, 364/281.8, 364/282.2, 364/285.2
US Class Current

714/38.13
CPC Class Codes

G06F 11/1438 Restarting or rejuvenating

Progressive retry method and apparatus for software failure recovery in multi-process message-passing applications

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Progressive retry method and apparatus for software failure recovery in multi-process message-passing applications

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links