System and method for comprehensive availability management in a high-availability computer system

US 6,691,244 B1
Filed: 03/14/2000
Issued: 02/10/2004
Est. Priority Date: 03/14/2000
Status: Expired due to Term

First Claim

Patent Images

1. In a high availability computer system including one or more nodes, each node including a plurality of components, wherein each component has an operational state, an availability management system for managing the operational states of the components, comprising:

a health monitor for performing a component status audit upon a component and reporting component status changes;

a timer for monitoring the health monitor and rebooting the node including the health monitor if the health monitor becomes non-responsive;

a multi-component error correlator for receiving the component status changes and applying pre-specified rules to determine whether a sequence of component status changes matches a known pattern, wherein the multi-component error correlator reports component status change pattern matches as component error reports; and

an availability manager to receive the component error reports and assign operational states to the components in accordance with the received component error reports.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for availability management coordinates operational states of components to implement a desired redundancy model within a high-availability computing system. Within the availability management system, an availability manager monitors various reports on the status of components and nodes within the system. The availability manager uses these reports to direct components to change states if necessary, in order to maintain the desired system redundancy model. The availability management system includes a health monitor for performing component status audits upon individual components and reporting component status changes. The system also includes a watch-dog timer, which monitors the health monitor and reboots the entire node containing the health monitor if it becomes non-responsive. Each node within the system also includes a cluster membership monitor, which monitors nodes becoming non-responsive and reports node non-responsive errors.

Citations

36 Claims

1. In a high availability computer system including one or more nodes, each node including a plurality of components, wherein each component has an operational state, an availability management system for managing the operational states of the components, comprising:
- a health monitor for performing a component status audit upon a component and reporting component status changes;
  
  a timer for monitoring the health monitor and rebooting the node including the health monitor if the health monitor becomes non-responsive;
  
  a multi-component error correlator for receiving the component status changes and applying pre-specified rules to determine whether a sequence of component status changes matches a known pattern, wherein the multi-component error correlator reports component status change pattern matches as component error reports; and
  
  an availability manager to receive the component error reports and assign operational states to the components in accordance with the received component error reports.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The availability management system of claim 1, further including:
3. The availability management system of claim 1, further including:
- an in-line error detector signal for reporting component status changes.
4. The availability management system of claim 1, wherein the availability manager publishes component operational states to other nodes within the highly available computer system.
5. The availability management system of claim 1, wherein an operational state of a component is active.
6. The availability management system of claim 1, wherein an operational state of a component is standby.
7. The availability management system of claim 1, wherein an operational state of a component is spare.
8. The availability management system of claim 1, wherein an operational state of a component is off-line.
9. The availability management system of claim 1, wherein a component status change is a component failure.
10. The availability management system of claim 1, wherein a component status change is a component loss of capacity.
11. The availability management system of claim 1, wherein a component status change is a new component available.
12. The availability management system of claim 1, wherein a component status change is a request to take a component off-line.
13. The availability management system of claim 1, wherein the step of performing a component status audit further includes:
- initiating an audit upon a component;
  
  reporting a component error to the multi-component error correlator if the audit fails to complete within a specified time; and
  
  initiating a component error to the multi-component error correlator if the audit detects a component failure.
14. The availability management system of claim 1, further including:
- a first node including the availability manager; and
  
  a second node including a proxy availability manager, wherein the proxy availability manager relays messages to the availability manager.
15. The availability management system of claim 1, further including:
- a first node including the availability manager; and
  
  a second node including a back-up availability manager, wherein the back-up availability manager assumes the functions of the availability manager if the availability manager fails.

16. In a high availability computer system including one or more nodes, each node including a plurality of components, wherein each component has an operational state, a method for managing the operational states of the components, comprising:
- receiving a plurality of event reports;
  
  receiving a plurality of component status reports for at least one of the components from a health monitor residing on one of the nodes of the computer system;
  
  monitoring the health monitor;
  
  when the monitoring indicates the health monitor is non-responsive, rebooting the node including the health monitor;
  
  applying pre-specified rules to the plurality of event reports and plurality of component status reports, wherein the event reports and component status reports are compared to known event patterns, and wherein an event pattern match generates a component error report;
  
  receiving a plurality of component error reports; and
  
  dynamically readjusting the operational states of at least one of the components based upon the component error reports.
- View Dependent Claims (17, 18, 19)
- - 17. The method of claim 16, further including:
18. The method of claim 16, wherein an event report is received through a publish/subscribe event notification system.
19. The method of claim 16, wherein a component status report is generated by a component performing an internal self-audit.

20. In a high availability computer system including a plurality of components, wherein each component has an operational state, a method for managing the operational states of the components, comprising:
- registering the plurality of components with an availability manager;
  
  registering each of the plurality of component'"'"'s associated states with an availability manager;
  
  accepting a plurality of reports regarding the status of components; and
  
  dynamically adjusting component state assignments based upon the reports, wherein the state assignments are selected from the group consisting of standby, spare, and off-line and wherein the reports indicate that a sequence of changes in the status of components matches a known pattern based on a set of pre-specified rules.

21. A computer program product for managing the operational states of the components in a high availability computer system including one or more nodes, each node including a plurality of components, wherein each component has an operational state, the computer program product comprising:
- program code configured to receive a plurality of event reports;
  
  program code configured to receive a plurality of component status reports;
  
  program code configured to apply pre-specified rules to the plurality of event reports and plurality of component status reports, wherein the event reports and component status reports are compared to known event patterns, and wherein an event pattern match generates a component error report;
  
  program code configured to receive a plurality of component error reports; and
  
  program code configured to dynamically readjust the operational states of the components based upon the component error reports, wherein the operational states are selected from the group of states consisting of standby, spare, and off-line.
- View Dependent Claims (22)
- - 22. The computer program product of claim 21, further including:

23. In a high availability computer system including one or more nodes, each node including a plurality of components, wherein each component has an operational state, an availability management system for managing the operational states of the components, comprising:
- a health monitor for performing a component status audit upon a component and reporting component status changes;
  
  a multi-component error correlator for receiving the component status changes and applying pre-specified rules to determine whether a sequence of component status changes matches a known pattern, wherein the multi-component error correlator reports component status change pattern matches as component error reports; and
  
  an availability manager to receive the component error reports and assign operational states to the components in accordance with the received component error reports, wherein the operational states of a component are selected from the group consisting of standby, spare, and off-line.
- View Dependent Claims (24, 25, 26, 27, 28, 29)
- - 24. The availability management system of claim 23, further including:
25. The availability management system of claim 23, further including:
- an in-line error detector signal for reporting component status changes.
26. The availability management system of claim 23, wherein the availability manager publishes component operational states to other nodes within the highly available computer system.
27. The availability management system of claim 23, wherein the component status changes are selected from the group consisting of a component failure, a component loss of capacity, a new component available, and a request to take a component off-line.
28. The availability management system of claim 23, wherein the step of performing a component status audit further includes:
- initiating an audit upon a component;
  
  reporting a component error to the multi-component error correlator if the audit fails to complete within a specified time; and
  
  initiating a component error to the multi-component error correlator if the audit detects a component failure.
29. The availability management system of claim 23, further including:
- a first node including the availability manager; and
  
  a second node including a proxy availability manager, wherein the proxy availability manager relays messages to the availability manager.

30. An availability management system for managing the operational states of the components in a high availability computer system including one or more nodes, each node including a plurality of components, the components each having an operational state, comprising:
- a health monitor for performing a component status audit upon a component and reporting component status changes;
  
  a multi-component error correlator for receiving the component status changes and applying pre-specified rules to determine whether a sequence of component status changes matches a known pattern, wherein the multi-component error correlator reports component status change pattern matches as component error reports and wherein the component status changes comprise a new component available or a request to take a component off-line; and
  
  an availability manager to receive the component error reports and assign operational states to the components in accordance with the received component error reports.
- View Dependent Claims (31, 32, 33, 34, 35)
- - 31. The availability management system of claim 30, further including:
32. The availability management system of claim 30, further including:
- an in-line error detector signal for reporting component status changes.
33. The availability management system of claim 30, wherein the availability manager publishes component operational states to other nodes within the highly available computer system.
34. The availability management system of claim 30, wherein the step of performing a component status audit further includes:
- initiating an audit upon a component;
  
  reporting a component error to the multi-component error correlator if the audit fails to complete within a specified time; and
  
  initiating a component error to the multi-component error correlator if the audit detects a component failure.
35. The availability management system of claim 30, further including:
- a first node including the availability manager; and
  
  a second node including a proxy availability manager, wherein the proxy availability manager relays messages to the availability manager.

36. An availability management system for managing the operational states of the components in a high availability computer system including one or more nodes, each node including a plurality of components, wherein each component has an operational state, comprising:
- a health monitor for performing a component status audit upon a component and reporting component status changes;
  
  a multi-component error correlator for receiving the component status changes and applying pre-specified rules to determine whether a sequence of component status changes matches a known pattern, wherein the multi-component error correlator reports component status change pattern matches as component error reports;
  
  an availability manager to receive the component error reports and assign operational states to the components in accordance with the received component error reports;
  
  a first node including the availability manager; and
  
  a second node including a back-up availability manager, wherein the back-up availability assumes the functions of the availability manager if the availability manager fails.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oracle America, Inc. (Oracle Corporation)
Original Assignee
Sun Microsystems Incorporated (Oracle Corporation)
Inventors
Hisgen, Andrew, Kampe, Mark A.
Primary Examiner(s)
BADERMAN, SCOTT T

Application Number

US09/525,200
Time in Patent Office

1,428 Days
Field of Search

714/4, 714/43, 714/57, 709/223, 709/224
US Class Current

714/4.1
CPC Class Codes

G06F 11/00 Error detection; Error corr...

System and method for comprehensive availability management in a high-availability computer system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

36 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for comprehensive availability management in a high-availability computer system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

36 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links