Method for monitoring and recovery of subsystems in a distributed/clustered system
First Claim
1. A method for monitoring and recovery of subsystems in a distributed computer system comprising the steps of:
- (a) executing a distributed software subsystem on the distributed system, said software subsystem not being self-recoverable from failure events;
(b) providing user-defined monitors for the software subsystem, each of the user-defined monitors including a set of user defined events to be detected; and
,(c) responsive to an occurrence of one of the events, performing recovery actions coordinated among the nodes of the distributed computer system as controlled by a user specified recovery program.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for a general and extensible infrastructure providing monitoring and recovery of interdependent systems in a distributed/clustered system is disclosed. Subsystems, built without provision for high availability, are incorporated into the infrastructure without modification to core subsystem function. The infrastructure is comprised of one or more computing nodes connected by one or more interconnection networks, and running one or more distributed subsystems. The infrastructure monitors the computing nodes using one or more heartbeat and membership protocols, and monitors the said distributed subsystems by subsystem-specific monitors. Events detected by monitors are sent to event handlers. Event handlers process events by filtering them through activities such as event correlation, removal of duplicates, and rollup. Filtered events are given by Event Managers to Recovery Drivers which determine the recovery program corresponding to the event, and executing the recovery program or set of recovery actions by coordination among the recovery managers. Given failures of said event handlers or recovery managers, the infrastructure performs the additional steps of: coordinating among remaining event handlers and recovery managers to handle completion or termination of ongoing recovery actions, discovering the current state of the system by resetting the said monitors, and handling any new failure events that may have occurred in the interim.
-
Citations
12 Claims
-
1. A method for monitoring and recovery of subsystems in a distributed computer system comprising the steps of:
-
(a) executing a distributed software subsystem on the distributed system, said software subsystem not being self-recoverable from failure events; (b) providing user-defined monitors for the software subsystem, each of the user-defined monitors including a set of user defined events to be detected; and
,(c) responsive to an occurrence of one of the events, performing recovery actions coordinated among the nodes of the distributed computer system as controlled by a user specified recovery program. - View Dependent Claims (2)
-
-
3. A method for operating a distributed system comprising the steps of:
-
executing a set of interdependent software subsystems run on nodes of the distributed system, each of said software subsystems not being self-recoverable from failure events; providing a user-defined set of monitors that probe the health of each subsystem and report failure events; providing a user-defined recovery program for each of a plurality of the failure events; and
,using the user-defined recovery program, coordinating and synchronizing the recovery of the interdependent software subsystems. - View Dependent Claims (4, 5, 6)
-
-
7. A method of providing error recovery in a distributed system, comprising the steps of:
-
monitoring computing nodes of the distributed system using at least one heartbeat and membership protocol, monitoring for subsystems running on the computing nodes using user defined monitors, at least one of said subsystems not being self-recoverable from failure events; reporting events detected by the user-defined monitors to at least one event handler; filtering the events in the event handler so as to provide filtered events; applying a set of rules to the filtered events to select a user-defined recovery program from a set of user-defined recovery programs; and
,coordinating among the nodes in the distributed system to execute a selected recovery program. - View Dependent Claims (8, 9, 10)
-
-
11. A system for providing error recovery in a distributed system, comprising:
-
a plurality of monitors in computing nodes of the distributed system using at least one heartbeat and membership protocol, a plurality of user-defined monitors for subsystems running on the computing nodes, the monitors including means for detecting events and sending reports of said events to event handlers, at least one of said subsystems not being self-recoverable from failure events; means for processing events, in the event handlers, by filtering the events by way of activities such as event correlation, removal of duplicate events, and rollup; means providing filtered events to recovery drivers, which have a rule base which specify user-defined recovery programs corresponding to events; and
,means for coordinating among the nodes in the distributed system to execute the recovery program. - View Dependent Claims (12)
-
Specification