Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applications in a network
First Claim
1. A computer system for fault tolerant computing comprising:
- a plurality of host computers interconnected on a network;
a first copy of an application module running on a first of said host computers;
a second copy of the application module operative on a second of said host computers;
a manager daemon process running on one of said plurality of host computers, the manager daemon process receiving an indication upon a failure of the first copy of the application module and initiating failure recovery with said second copy of the application module; and
means for providing a registration message to said manager daemon process, said registration message specifying said application module and a style of replication to be maintained by said manager daemon process for said application module from among a plurality of different replication styles;
wherein said second copy is maintained in an operative state for fail-over protection upon a failure of the first copy of the application module in accordance with the registered replication style.
9 Assignments
0 Petitions
Accused Products
Abstract
An application module (A) running on a host computer in a computer network is failure-protected with one or more backup copies that are operative on other host computers in the network. In order to effect fault protection, the application module registers itself with a ReplicaManager daemon process (112) by sending a registration message, which message, in addition to identifying the registering application module and the host computer on which it is running, includes the particular replication strategy (cold backup, warm backup, or hot backup) and the degree of replication associated with that application module. The backup copies are then maintained in a fail-over state according to the registered replication strategy. A WatchDog daemon (113), running on the same host computer as the registered application periodically monitors the registered application to detect failures. When a failure, such as a crash or hangup of the application module, is detected, the failure is reported to the ReplicaManager, which effects the requested fail-over actions. An additional backup copy is then made operative in accordance with the registered replication style and the registered degree of replication. A SuperWatchDog daemon process (115-1), running on the same host computer as the ReplicaManager, monitors each host computer in the computer network. When a host failure is detected, each application module running on that host computer is individually failure-protected in accordance with its registered replication style and degree of replication.
191 Citations
29 Claims
-
1. A computer system for fault tolerant computing comprising:
-
a plurality of host computers interconnected on a network;
a first copy of an application module running on a first of said host computers;
a second copy of the application module operative on a second of said host computers;
a manager daemon process running on one of said plurality of host computers, the manager daemon process receiving an indication upon a failure of the first copy of the application module and initiating failure recovery with said second copy of the application module; and
means for providing a registration message to said manager daemon process, said registration message specifying said application module and a style of replication to be maintained by said manager daemon process for said application module from among a plurality of different replication styles;
wherein said second copy is maintained in an operative state for fail-over protection upon a failure of the first copy of the application module in accordance with the registered replication style. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
a first failure-detection daemon process running on said first host computer, said first failure-detection daemon process monitoring the ability of said first copy of the application module to continue to run, said first failure-detection daemon process sending to said manager daemon process a message indicating a failure of said first copy upon detecting a failure.
-
-
5. The computer system of claim 4 further comprising:
a checkpoint server connected to the network, said checkpoint server periodically storing the states of said first copy of the application module and said manager daemon process.
-
6. The computer system of claim 5 wherein upon detection of the failure of said first copy of the application module, said second host computer is signaled for the second copy to assume the processing functions of said first copy, said second copy retrieving from said checkpoint server the last stored state of said first copy.
-
7. The computer system of claim 5 further comprising:
a second failure-detection daemon process running on the same host computer as the manager daemon process, said second failure-detection process monitoring said first host computer for a failure.
-
8. The computer system of claim 7 wherein upon detection of a failure of said first host computer, said second copy of the application module is signaled to assume the processing functions of said first copy, said second copy retrieving from said checkpoint server the last stored state of said first copy of the application module.
-
9. The computer system of system of claim 7 further comprising:
a backup copy of said second failure-detection daemon process running on another one of said plurality of host computers different than the host computer on which the second failure-detection daemon process is running, said backup copy of said second failure-detection process monitoring said second host computer for a failure.
-
10. The computer system of claim 9 wherein upon detection of a failure of said second host computer, said backup copy of said second failure-detection daemon process assumes the processing functions of said second failure-detection daemon process and initiates running of a copy of said manager daemon process on said same another one of the host computers, said copy of said manager daemon process retrieving from said checkpoint server the stored state of said manager daemon process when it was running on its host computer.
-
11. The computer system of claim 3 wherein the registration message for the application module further specifies a degree of replication that indicates for a hot or warm backup replication style the number of copies of the application module to be maintained running on said plurality of host computers in the network.
-
12. The computer system of claim 6 wherein the registration message for the application module further specifies a fail-over strategy, the fail-over strategy indicating whether said second copy should assume the processing functions of said first copy of the application module each time a failure of said first copy is detected by said first failure-detection process, or whether said second copy should assume the processing functions of said copy only after the number of failures of said first copy on said first host computer reaches a predetermined threshold.
-
13. A fault-managing computer apparatus on a host computer in a computer system, said apparatus comprising:
-
a manager daemon process for receiving an indication of a failure of a first copy of an application module running on a first host computer in the computer system and for initiating failure recovery with a second copy of the application module on a second host computer; and
means for receiving a registration message from the first copy of the application module specifying said application module and a style of replication to be maintained for said application module from among a plurality of different replication styles;
wherein the second copy is maintained in an operative state for fail-over protection upon a failure of the first copy of the application module in accordance with the registered replication style. - View Dependent Claims (14, 15, 16, 17, 18)
-
-
19. A fault-tolerant computing apparatus for use in a computer system, said apparatus comprising:
-
a failure-detection daemon process running on said apparatus, said failure-detection daemon process monitoring the ability of a first copy of an application module to continue to run on said apparatus; and
means for sending a registration message to a manager daemon process specifying the application module and a style of replication from among a plurality of different replication styles to be maintained by the manager daemon process for the application module with respect to a second copy of the application module that is operative on another computer apparatus in the computer system;
wherein the second copy is maintained in an operative state for fail-over protection upon a failure of the first application module in accordance with the registered replication style. - View Dependent Claims (20, 21, 22)
-
-
23. A method for operating a fault-tolerant computer system, said system comprising a plurality of host computers interconnected on a network, a first copy of an application module running on a first of the plurality of the host computers and a second copy of the first application module on a second of the plurality of host computers, said method comprising the steps of:
-
receiving a registration message specifying the application module and a style of replication to be maintained for the application module from among a plurality of different replication styles; and
maintaining said second copy in an operative state for fail-over protection upon a failure of the first application module in accordance with the registered replication style. - View Dependent Claims (24, 25, 26, 27, 28, 29)
receiving an indication upon a failure of the first copy of the application module; and
initiating failure recovery for the failed first copy with the second copy on the second host computer.
-
-
25. The method of claim 23 wherein the different replication styles indicate whether or not the second copy is to run simultaneously while the first copy of the application module runs on the first host computer, and if the second copy is to simultaneously run, whether the second copy can receive and respond to a client request.
-
26. The method of claim 23 wherein the different replication styles are cold backup, warm backup and hot backup.
-
27. The method of claim 23 further comprising the steps of:
-
monitoring the first host computer for a failure; and
upon detecting a failure of the first host computer, initiating failure recover for the first copy of the application module with the second copy on the second host computer.
-
-
28. The method of claim 26 wherein the registration message for the first application module further specifies a degree of replication that indicates the number of copies of the application module to be maintained running on said plurality of host computers for a hot or warm backup replication style.
-
29. The method of claim 24 wherein the registration message for the application module further specifies a fail-over strategy, the fail-over strategy indicating whether the second copy assumes the processing functions of the first copy of the application module each time a failure of the first copy is detected, or whether the second copy assumes the processing functions of the first application module only after the number of failures of the first copy of the application module reaches a predetermined number.
Specification