Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applications in a network

US 6,266,781 B1
Filed: 07/20/1998
Issued: 07/24/2001
Est. Priority Date: 07/20/1998
Status: Expired due to Term

First Claim

Patent Images

1. A computer system for fault tolerant computing comprising:

a plurality of host computers interconnected on a network;

a first copy of an application module running on a first of said host computers;

a second copy of the application module operative on a second of said host computers;

a manager daemon process running on one of said plurality of host computers, the manager daemon process receiving an indication upon a failure of the first copy of the application module and initiating failure recovery with said second copy of the application module; and

means for providing a registration message to said manager daemon process, said registration message specifying said application module and a style of replication to be maintained by said manager daemon process for said application module from among a plurality of different replication styles;

wherein said second copy is maintained in an operative state for fail-over protection upon a failure of the first copy of the application module in accordance with the registered replication style.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An application module (A) running on a host computer in a computer network is failure-protected with one or more backup copies that are operative on other host computers in the network. In order to effect fault protection, the application module registers itself with a ReplicaManager daemon process (112) by sending a registration message, which message, in addition to identifying the registering application module and the host computer on which it is running, includes the particular replication strategy (cold backup, warm backup, or hot backup) and the degree of replication associated with that application module. The backup copies are then maintained in a fail-over state according to the registered replication strategy. A WatchDog daemon (113), running on the same host computer as the registered application periodically monitors the registered application to detect failures. When a failure, such as a crash or hangup of the application module, is detected, the failure is reported to the ReplicaManager, which effects the requested fail-over actions. An additional backup copy is then made operative in accordance with the registered replication style and the registered degree of replication. A SuperWatchDog daemon process (115-1), running on the same host computer as the ReplicaManager, monitors each host computer in the computer network. When a host failure is detected, each application module running on that host computer is individually failure-protected in accordance with its registered replication style and degree of replication.

191 Citations

29 Claims

1. A computer system for fault tolerant computing comprising:
- a plurality of host computers interconnected on a network;
  
  a first copy of an application module running on a first of said host computers;
  
  a second copy of the application module operative on a second of said host computers;
  
  a manager daemon process running on one of said plurality of host computers, the manager daemon process receiving an indication upon a failure of the first copy of the application module and initiating failure recovery with said second copy of the application module; and
  
  means for providing a registration message to said manager daemon process, said registration message specifying said application module and a style of replication to be maintained by said manager daemon process for said application module from among a plurality of different replication styles;
  
  wherein said second copy is maintained in an operative state for fail-over protection upon a failure of the first copy of the application module in accordance with the registered replication style.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The computer system of claim 1 wherein said different replication styles indicate whether or not the second copy of the application module is to run on said second host computer simultaneously while said first copy of the application module runs on said first host computer, and if said second copy is to simultaneously run, whether said second copy can receive and respond to a client request.
  - 3. The computer system of claim 2 wherein the different replication styles are cold backup, warm backup and hot backup, wherein in accordance with the cold backup style, said second copy does not run while said first copy of the application module runs;
    - in accordance with the warm backup style, said second copy runs while said first copy of the application module runs but cannot not receive and respond to a client request; and
      
      in accordance with the hot backup style, said second copy runs while said first copy of the application module runs and can receive and respond to a client request.
  - 4. The computer system of claim 1 further comprising:
5. The computer system of claim 4 further comprising:
- a checkpoint server connected to the network, said checkpoint server periodically storing the states of said first copy of the application module and said manager daemon process.
6. The computer system of claim 5 wherein upon detection of the failure of said first copy of the application module, said second host computer is signaled for the second copy to assume the processing functions of said first copy, said second copy retrieving from said checkpoint server the last stored state of said first copy.
7. The computer system of claim 5 further comprising:
- a second failure-detection daemon process running on the same host computer as the manager daemon process, said second failure-detection process monitoring said first host computer for a failure.
8. The computer system of claim 7 wherein upon detection of a failure of said first host computer, said second copy of the application module is signaled to assume the processing functions of said first copy, said second copy retrieving from said checkpoint server the last stored state of said first copy of the application module.
9. The computer system of system of claim 7 further comprising:
- a backup copy of said second failure-detection daemon process running on another one of said plurality of host computers different than the host computer on which the second failure-detection daemon process is running, said backup copy of said second failure-detection process monitoring said second host computer for a failure.
10. The computer system of claim 9 wherein upon detection of a failure of said second host computer, said backup copy of said second failure-detection daemon process assumes the processing functions of said second failure-detection daemon process and initiates running of a copy of said manager daemon process on said same another one of the host computers, said copy of said manager daemon process retrieving from said checkpoint server the stored state of said manager daemon process when it was running on its host computer.
11. The computer system of claim 3 wherein the registration message for the application module further specifies a degree of replication that indicates for a hot or warm backup replication style the number of copies of the application module to be maintained running on said plurality of host computers in the network.
12. The computer system of claim 6 wherein the registration message for the application module further specifies a fail-over strategy, the fail-over strategy indicating whether said second copy should assume the processing functions of said first copy of the application module each time a failure of said first copy is detected by said first failure-detection process, or whether said second copy should assume the processing functions of said copy only after the number of failures of said first copy on said first host computer reaches a predetermined threshold.

13. A fault-managing computer apparatus on a host computer in a computer system, said apparatus comprising:
- a manager daemon process for receiving an indication of a failure of a first copy of an application module running on a first host computer in the computer system and for initiating failure recovery with a second copy of the application module on a second host computer; and
  
  means for receiving a registration message from the first copy of the application module specifying said application module and a style of replication to be maintained for said application module from among a plurality of different replication styles;
  
  wherein the second copy is maintained in an operative state for fail-over protection upon a failure of the first copy of the application module in accordance with the registered replication style.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The apparatus of claim 13 wherein the different replication styles are cold backup, warm backup and hot backup.
  - 15. The apparatus of claim 13 wherein upon receiving an indication of a failure of the first copy of the application module, said manager daemon process signals the second host computer for the second copy to assume the processing functions of the first copy of the application module.
  - 16. The apparatus of claim 13 further comprising a failure-detection daemon process for monitoring the first host computer for a failure.
  - 17. The apparatus of claim 16 wherein upon said failure-detection daemon process detecting a failure of the first host computer, said manager daemon process signals the second host computer for the second copy to assume the processing functions of the first copy of the application module.
  - 18. The apparatus of claim 14 wherein the registration message further specifies a degree of replication that indicates the number of copies of the application module to maintained running in the computer system for a hot or warm backup replication style.

19. A fault-tolerant computing apparatus for use in a computer system, said apparatus comprising:
- a failure-detection daemon process running on said apparatus, said failure-detection daemon process monitoring the ability of a first copy of an application module to continue to run on said apparatus; and
  
  means for sending a registration message to a manager daemon process specifying the application module and a style of replication from among a plurality of different replication styles to be maintained by the manager daemon process for the application module with respect to a second copy of the application module that is operative on another computer apparatus in the computer system;
  
  wherein the second copy is maintained in an operative state for fail-over protection upon a failure of the first application module in accordance with the registered replication style.
- View Dependent Claims (20, 21, 22)
- - 20. The apparatus of claim 19 wherein the different replication styles are cold backup, warm backup and hot backup.
  - 21. The apparatus of claim 19 wherein the second copy of the application module in the computer system assumes the processing functions of the first copy of the application module upon detecting a failure of the first copy of the application module.
  - 22. The apparatus of claim 19 wherein the registration message further specifies a degree of replication that indicates the number of copies of the application module to be maintained running in the computer system for a hot or warm backup replication style.

23. A method for operating a fault-tolerant computer system, said system comprising a plurality of host computers interconnected on a network, a first copy of an application module running on a first of the plurality of the host computers and a second copy of the first application module on a second of the plurality of host computers, said method comprising the steps of:
- receiving a registration message specifying the application module and a style of replication to be maintained for the application module from among a plurality of different replication styles; and
  
  maintaining said second copy in an operative state for fail-over protection upon a failure of the first application module in accordance with the registered replication style.
- View Dependent Claims (24, 25, 26, 27, 28, 29)
- - 24. The method of claim 23 further comprising the steps of:
25. The method of claim 23 wherein the different replication styles indicate whether or not the second copy is to run simultaneously while the first copy of the application module runs on the first host computer, and if the second copy is to simultaneously run, whether the second copy can receive and respond to a client request.
26. The method of claim 23 wherein the different replication styles are cold backup, warm backup and hot backup.
27. The method of claim 23 further comprising the steps of:
- monitoring the first host computer for a failure; and
  
  upon detecting a failure of the first host computer, initiating failure recover for the first copy of the application module with the second copy on the second host computer.
28. The method of claim 26 wherein the registration message for the first application module further specifies a degree of replication that indicates the number of copies of the application module to be maintained running on said plurality of host computers for a hot or warm backup replication style.
29. The method of claim 24 wherein the registration message for the application module further specifies a fail-over strategy, the fail-over strategy indicating whether the second copy assumes the processing functions of the first copy of the application module each time a failure of the first copy is detected, or whether the second copy assumes the processing functions of the first application module only after the number of failures of the first copy of the application module reaches a predetermined number.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nokia of America Corporation (Nokia Corporation)
Original Assignee
Academia Sinica (Government of The Republic of China), Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Liang, Deron, Shih, Chia-Yen, Yajnik, Shalini, Huang, Yennun, Chung, Pi-Yu
Primary Examiner(s)
Iqbal, Nadeem

Application Number

US09/119,139
Time in Patent Office

1,100 Days
Field of Search

714/4, 714/6, 714/7, 714/11, 714/12, 714/13, 714/25, 714/31, 714/39, 709/300, 709/400
US Class Current

714/4.1
CPC Class Codes

G06F 11/0757   by exceeding a time limit, ...

G06F 11/1438   Restarting or rejuvenating

G06F 11/2023   Failover techniques

G06F 11/2038   with a single idle spare pr...

G06F 11/2097   maintaining the standby con...

Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applications in a network

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

191 Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applications in a network

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

191 Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links