Method and apparatus for providing failure detection and recovery with predetermined degree of replication for distributed applications in a network

US 6,195,760 B1
Filed: 07/20/1998
Issued: 02/27/2001
Est. Priority Date: 07/20/1998
Status: Expired due to Term

First Claim

Patent Images

1. A computer system for fault tolerant computing comprising:

a plurality of host computers interconnected on a network;

one or more copies of an application module each running on a different one of said plurality of host computers;

one or more idle backup copies of the application module each stored on a different one of said host computers;

a manager daemon process running on one of said plurality of host computers, the manager daemon process receiving an indication upon a failure of one of said running copies of the application module and initiating failure recovery; and

means for providing a registration message to said manager daemon process, said registration message specifying said application module and a degree of replication of said application module, said degree of replication indicating the number of running copies of the application module to be maintained in the system;

wherein the number of running copies of the application module is maintained at the registered degree of replication by executing at least one of said idle backup copies upon detecting one or more failures, respectively, of any of the running copies of said application module.

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An application module (A) running on a host computer in a computer network is failure-protected with one or more backup copies that are operative on other host computers in the network. In order to effect fault protection, the application module registers itself with a ReplicaManager daemon process (112) by sending a registration message, which message, in addition to identifying the registering application module and the host computer on which it is running, includes the particular replication strategy (cold backup, warm backup, or hot backup) and the degree of replication associated with that application module. The backup copies are then maintained in a fail-over state according to the registered replication strategy. A WatchDog daemon (113), running on the same host computer as the registered application periodically monitors the registered application to detect failures. When a failure, such as a crash or hangup of the application module, is detected, the failure is reported to the ReplicaManager, which effects the requested fail-over actions. An additional backup copy is then made operative in accordance with the registered replication style and the registered degree of replication. A SuperWatchDog daemon process (115-1), running on the same host computer as the ReplicaManager, monitors each host computer in the computer network. When a host failure is detected, each application module running on that host computer is individually failure-protected in accordance with its registered replication style and degree of replication.

185 Citations

22 Claims

1. A computer system for fault tolerant computing comprising:
- a plurality of host computers interconnected on a network;
  
  one or more copies of an application module each running on a different one of said plurality of host computers;
  
  one or more idle backup copies of the application module each stored on a different one of said host computers;
  
  a manager daemon process running on one of said plurality of host computers, the manager daemon process receiving an indication upon a failure of one of said running copies of the application module and initiating failure recovery; and
  
  means for providing a registration message to said manager daemon process, said registration message specifying said application module and a degree of replication of said application module, said degree of replication indicating the number of running copies of the application module to be maintained in the system;
  
  wherein the number of running copies of the application module is maintained at the registered degree of replication by executing at least one of said idle backup copies upon detecting one or more failures, respectively, of any of the running copies of said application module.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The computer system of claim 1 further comprising:
3. The computer system of claim 2 further comprising:
- a checkpoint server connected to the network, said checkpoint server periodically storing the states of each of said running copies of said application module and said manager daemon process.
4. The computer system of claim 3 wherein upon detection of the failure of one of said running copies of said application module, said manager daemon process signals one of said at least one idle backup copies to execute and to assume the processing functions of the failed copy, said one backup copy retrieving from said checkpoint server the last stored state of the failed copy of the application module.
5. The computer system of claim 3 further comprising:
- a second failure-detection daemon process running on the same host computer as the manager daemon process, said second failure-detection process monitoring a host computer on which one of the copies of the application module is running for a failure.
6. The computer system of claim 5 wherein upon detection of a failure of the monitored host computer, said manager daemon process signals one of said idle backup copies to execute and to assume the processing functions of the copy of the application module running on the failed host computer, the executed backup copy retrieving from said checkpoint server the last stored state of the copy of the application module running on the failed host computer.
7. The computer system of system of claim 5 further comprising:
- a backup copy of said second failure-detection daemon process running on one of said plurality of host computers other than the host computer on which the second failure-detection daemon process is running, said copy of said second failure-detection process monitoring the host computer on which the second failure-detection daemon process is running for a failure.
8. The computer system of claim 7 wherein upon detection of a failure of the host computer on which the second failure-detection daemon process is running, said backup copy of said second failure-detection daemon process assumes the processing functions of said second failure-detection daemon process and initiates running of a copy of said manager daemon process on its own host computer, said copy of said manager daemon process retrieving from said checkpoint server the last stored state of said manager daemon process while it was running on said failed host computer.
9. The computer system of claim 1 wherein the registration message for the application module further specifies a style of replication that indicates whether the replication style for the application module is to be cold, warm or hot.
10. The computer system of claim 4 wherein the registration message for the application module further specifies a fail-over strategy, the fail-over strategy indicating whether one of said idle backup copies should assume the processing functions of a failed one of said running copies each time a failure of that one running copy is detected by said failure-detection process, or whether said one of said idle backup copies should assume the processing functions of said one failed running copy only after the number of failures of that one copy of said application module reaches a predetermined threshold.

11. A fault-managing computer apparatus on a host computer in a computer system, said apparatus comprising:
- a manager daemon process for receiving an indication of a failure of a copy of an application module running on at least one of a plurality of host computers in the computer system and for initiating failure recovery with at least one idle backup copy of the application module; and
  
  means for receiving a registration message specifying the application module and a degree of replication for the application module, said degree of replication indicating the number of running copies of the application module to be maintained in the system;
  
  wherein the number of running copies of the application module in the system is maintained at the registered degree of replication by executing one of the idle backup copies upon detecting a failure of one of the running copies of the application module.
- View Dependent Claims (12, 13, 14)
- - 12. The apparatus of claim 11 wherein upon receiving an indication of a failure of one of the running copies of the application module said manager daemon process signals one of the idle backup copies to assume the processing functions of the failed copy.
  - 13. The apparatus of claim 11 further comprising a failure-detection daemon process for monitoring each host computer in the system for a failure.
  - 14. The apparatus of claim 13 wherein upon said failure-detection daemon process detecting a failure of one of the host computers on which a copy of the application module is running, said manager daemon process signals one of said at least one idle backup copies to assume the processing functions of the copy of the application module on the failed host computer.

15. A fault-tolerant computing apparatus for use in a computer system, said apparatus comprising:
- a failure-detection daemon process running on said apparatus, said failure-detection daemon process monitoring the ability of a running copy of an application module to continue to run on said apparatus; and
  
  means for sending a registration message to a manager daemon process specifying the application module and a degree of replication to be maintained by the manager daemon process for the application module with respect to the number of running copies of the application module to be maintained in the system;
  
  wherein the number of running copies of the application module in the system is maintained at the registered degree of replication by executing an idle backup copy of the application module on a different computing apparatus upon detecting a failure of the running copy of the application module.
- View Dependent Claims (16, 17)
- - 16. The apparatus of claim 15 wherein upon detecting a failure of the running copy of the application module on the apparatus, the idle backup copy of the application module is executed and assumes the processing functions of the failed copy.
  - 17. The apparatus of claim 15 wherein the registration message further specifies a style of replication that indicates that the application module is to be replicated in the computer system with a cold, warm or hot backup style.

18. A method for operating a fault-tolerant computer system, said system comprising a plurality of host computers interconnected on a network, one or more copies of an application module each one running on a different one of said plurality of host computers, and one or more idle backup copies of the application module each stored on a different one of said host computers;
- said method comprising the steps of;
  
  receiving a registration message specifying the application module and a degree of replication to be maintained for the application module, said degree of replication indicating the number of running copies of the application module to be maintained in the system; and
  
  executing at least one of the idle backup copies upon detecting a failure of one of the running copies of the application module to maintain the total number of running copies of the application module in the system at the registered degree of replication.
- View Dependent Claims (19, 20, 21, 22)
- - 19. The method of claim 18 further comprising the steps of:
20. The method of claim 18 further comprising the steps of:
- monitoring one of the host computers on which a copy of the application module is running; and
  
  upon detecting a failure of that host computer, initiating failure recovery for the copy of the application module on that host computer with one of the idle backup copies.
21. The method of claim 18 wherein the registration message for the application module further specifies a style of replication that indicates whether the replication style for the application module is to be cold, warm or hot.
22. The method of claim 19 wherein the registration message for the application module further specifies a fail-over strategy, the fail-over strategy indicating whether one of the idle backup copies should assume the processing functions of a failed one of the running copies each time a failure of that one running copy is detected, or whether one of the idle backup copies should assume the processing functions of that one failed running copy only after the number of failures of that one copy a predetermined threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Academia Sinica (Government of The Republic of China), Nokia of America Corporation (Nokia Corporation), Sound View Innovations, LLC (Sound View Innovation Holdings, LLC)
Original Assignee
Academia Sinica (Government of The Republic of China), Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Chung, Pi-Yu, Huang, Yennun, Liang, Deron, Shih, Chia-Yen, Yajnik, Shalini
Primary Examiner(s)
Beausoliel, Jr., Robert W.
Assistant Examiner(s)
Baderman, Scott T.

Application Number

US09/119,140
Time in Patent Office

953 Days
Field of Search

714/4, 714/6, 714/7, 714/11, 714/13, 714/16, 714/47, 714/57, 709/223-224, 707/202, 707/204
US Class Current

714/4.1
CPC Class Codes

G06F 11/0757 by exceeding a time limit, ...

G06F 11/1438 Restarting or rejuvenating

Method and apparatus for providing failure detection and recovery with predetermined degree of replication for distributed applications in a network

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

185 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for providing failure detection and recovery with predetermined degree of replication for distributed applications in a network

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

185 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links