System and method for comprehensive availability management in a high-availability computer system
First Claim
1. In a high availability computer system including one or more nodes, each node including a plurality of components, wherein each component has an operational state, an availability management system for managing the operational states of the components, comprising:
- a health monitor for performing a component status audit upon a component and reporting component status changes;
a timer for monitoring the health monitor and rebooting the node including the health monitor if the health monitor becomes non-responsive;
a multi-component error correlator for receiving the component status changes and applying pre-specified rules to determine whether a sequence of component status changes matches a known pattern, wherein the multi-component error correlator reports component status change pattern matches as component error reports; and
an availability manager to receive the component error reports and assign operational states to the components in accordance with the received component error reports.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for availability management coordinates operational states of components to implement a desired redundancy model within a high-availability computing system. Within the availability management system, an availability manager monitors various reports on the status of components and nodes within the system. The availability manager uses these reports to direct components to change states if necessary, in order to maintain the desired system redundancy model. The availability management system includes a health monitor for performing component status audits upon individual components and reporting component status changes. The system also includes a watch-dog timer, which monitors the health monitor and reboots the entire node containing the health monitor if it becomes non-responsive. Each node within the system also includes a cluster membership monitor, which monitors nodes becoming non-responsive and reports node non-responsive errors.
-
Citations
36 Claims
-
1. In a high availability computer system including one or more nodes, each node including a plurality of components, wherein each component has an operational state, an availability management system for managing the operational states of the components, comprising:
-
a health monitor for performing a component status audit upon a component and reporting component status changes;
a timer for monitoring the health monitor and rebooting the node including the health monitor if the health monitor becomes non-responsive;
a multi-component error correlator for receiving the component status changes and applying pre-specified rules to determine whether a sequence of component status changes matches a known pattern, wherein the multi-component error correlator reports component status change pattern matches as component error reports; and
an availability manager to receive the component error reports and assign operational states to the components in accordance with the received component error reports. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
a cluster membership monitor for monitoring node non-responsive errors and reporting node non-responsive errors, wherein the availability manager receives the component error reports and node non-responsive errors, and assigns operational states to the components in accordance with the received component error reports and node non-responsive errors.
-
-
3. The availability management system of claim 1, further including:
an in-line error detector signal for reporting component status changes.
-
4. The availability management system of claim 1, wherein the availability manager publishes component operational states to other nodes within the highly available computer system.
-
5. The availability management system of claim 1, wherein an operational state of a component is active.
-
6. The availability management system of claim 1, wherein an operational state of a component is standby.
-
7. The availability management system of claim 1, wherein an operational state of a component is spare.
-
8. The availability management system of claim 1, wherein an operational state of a component is off-line.
-
9. The availability management system of claim 1, wherein a component status change is a component failure.
-
10. The availability management system of claim 1, wherein a component status change is a component loss of capacity.
-
11. The availability management system of claim 1, wherein a component status change is a new component available.
-
12. The availability management system of claim 1, wherein a component status change is a request to take a component off-line.
-
13. The availability management system of claim 1, wherein the step of performing a component status audit further includes:
-
initiating an audit upon a component;
reporting a component error to the multi-component error correlator if the audit fails to complete within a specified time; and
initiating a component error to the multi-component error correlator if the audit detects a component failure.
-
-
14. The availability management system of claim 1, further including:
-
a first node including the availability manager; and
a second node including a proxy availability manager, wherein the proxy availability manager relays messages to the availability manager.
-
-
15. The availability management system of claim 1, further including:
-
a first node including the availability manager; and
a second node including a back-up availability manager, wherein the back-up availability manager assumes the functions of the availability manager if the availability manager fails.
-
-
16. In a high availability computer system including one or more nodes, each node including a plurality of components, wherein each component has an operational state, a method for managing the operational states of the components, comprising:
-
receiving a plurality of event reports;
receiving a plurality of component status reports for at least one of the components from a health monitor residing on one of the nodes of the computer system;
monitoring the health monitor;
when the monitoring indicates the health monitor is non-responsive, rebooting the node including the health monitor;
applying pre-specified rules to the plurality of event reports and plurality of component status reports, wherein the event reports and component status reports are compared to known event patterns, and wherein an event pattern match generates a component error report;
receiving a plurality of component error reports; and
dynamically readjusting the operational states of at least one of the components based upon the component error reports. - View Dependent Claims (17, 18, 19)
receiving a plurality of node non-responsive reports; and
dynamically readjusting the operational states of the components based upon the component error reports and the node non-responsive reports.
-
-
18. The method of claim 16, wherein an event report is received through a publish/subscribe event notification system.
-
19. The method of claim 16, wherein a component status report is generated by a component performing an internal self-audit.
-
20. In a high availability computer system including a plurality of components, wherein each component has an operational state, a method for managing the operational states of the components, comprising:
-
registering the plurality of components with an availability manager;
registering each of the plurality of component'"'"'s associated states with an availability manager;
accepting a plurality of reports regarding the status of components; and
dynamically adjusting component state assignments based upon the reports, wherein the state assignments are selected from the group consisting of standby, spare, and off-line and wherein the reports indicate that a sequence of changes in the status of components matches a known pattern based on a set of pre-specified rules.
-
-
21. A computer program product for managing the operational states of the components in a high availability computer system including one or more nodes, each node including a plurality of components, wherein each component has an operational state, the computer program product comprising:
-
program code configured to receive a plurality of event reports;
program code configured to receive a plurality of component status reports;
program code configured to apply pre-specified rules to the plurality of event reports and plurality of component status reports, wherein the event reports and component status reports are compared to known event patterns, and wherein an event pattern match generates a component error report;
program code configured to receive a plurality of component error reports; and
program code configured to dynamically readjust the operational states of the components based upon the component error reports, wherein the operational states are selected from the group of states consisting of standby, spare, and off-line. - View Dependent Claims (22)
program code configured to receive a plurality of node non-responsive reports; and
program code configured to dynamically readjust the operational states of the components based upon the component error reports and the node non-responsive reports.
-
-
23. In a high availability computer system including one or more nodes, each node including a plurality of components, wherein each component has an operational state, an availability management system for managing the operational states of the components, comprising:
-
a health monitor for performing a component status audit upon a component and reporting component status changes;
a multi-component error correlator for receiving the component status changes and applying pre-specified rules to determine whether a sequence of component status changes matches a known pattern, wherein the multi-component error correlator reports component status change pattern matches as component error reports; and
an availability manager to receive the component error reports and assign operational states to the components in accordance with the received component error reports, wherein the operational states of a component are selected from the group consisting of standby, spare, and off-line. - View Dependent Claims (24, 25, 26, 27, 28, 29)
a cluster membership monitor for monitoring node non-responsive errors and reporting node non-responsive errors, wherein the availability manager receives the component error reports and node non-responsive errors, and assigns operational states to the components in accordance with the received component error reports and node non-responsive errors.
-
-
25. The availability management system of claim 23, further including:
an in-line error detector signal for reporting component status changes.
-
26. The availability management system of claim 23, wherein the availability manager publishes component operational states to other nodes within the highly available computer system.
-
27. The availability management system of claim 23, wherein the component status changes are selected from the group consisting of a component failure, a component loss of capacity, a new component available, and a request to take a component off-line.
-
28. The availability management system of claim 23, wherein the step of performing a component status audit further includes:
-
initiating an audit upon a component;
reporting a component error to the multi-component error correlator if the audit fails to complete within a specified time; and
initiating a component error to the multi-component error correlator if the audit detects a component failure.
-
-
29. The availability management system of claim 23, further including:
-
a first node including the availability manager; and
a second node including a proxy availability manager, wherein the proxy availability manager relays messages to the availability manager.
-
-
30. An availability management system for managing the operational states of the components in a high availability computer system including one or more nodes, each node including a plurality of components, the components each having an operational state, comprising:
-
a health monitor for performing a component status audit upon a component and reporting component status changes;
a multi-component error correlator for receiving the component status changes and applying pre-specified rules to determine whether a sequence of component status changes matches a known pattern, wherein the multi-component error correlator reports component status change pattern matches as component error reports and wherein the component status changes comprise a new component available or a request to take a component off-line; and
an availability manager to receive the component error reports and assign operational states to the components in accordance with the received component error reports. - View Dependent Claims (31, 32, 33, 34, 35)
a cluster membership monitor for monitoring node non-responsive errors and reporting node non-responsive errors, wherein the availability manager receives the component error reports and node non-responsive errors, and assigns operational states to the components in accordance with the received component error reports and node non-responsive errors.
-
-
32. The availability management system of claim 30, further including:
an in-line error detector signal for reporting component status changes.
-
33. The availability management system of claim 30, wherein the availability manager publishes component operational states to other nodes within the highly available computer system.
-
34. The availability management system of claim 30, wherein the step of performing a component status audit further includes:
-
initiating an audit upon a component;
reporting a component error to the multi-component error correlator if the audit fails to complete within a specified time; and
initiating a component error to the multi-component error correlator if the audit detects a component failure.
-
-
35. The availability management system of claim 30, further including:
-
a first node including the availability manager; and
a second node including a proxy availability manager, wherein the proxy availability manager relays messages to the availability manager.
-
-
36. An availability management system for managing the operational states of the components in a high availability computer system including one or more nodes, each node including a plurality of components, wherein each component has an operational state, comprising:
-
a health monitor for performing a component status audit upon a component and reporting component status changes;
a multi-component error correlator for receiving the component status changes and applying pre-specified rules to determine whether a sequence of component status changes matches a known pattern, wherein the multi-component error correlator reports component status change pattern matches as component error reports;
an availability manager to receive the component error reports and assign operational states to the components in accordance with the received component error reports;
a first node including the availability manager; and
a second node including a back-up availability manager, wherein the back-up availability assumes the functions of the availability manager if the availability manager fails.
-
Specification