Managing Faults in a High Availability System
First Claim
1. A method of managing a failure of a critical high availability (HA) component, the method comprising the steps of:
- a computer identifying a plurality of critical HA components of a HA system;
the computer receiving and assigning categories to the plurality of critical HA components;
the computer receiving and assigning weights to the categories;
the computer obtaining a current value indicating a performance of a critical HA component included in the plurality of critical HA components, the current value obtained by periodically monitoring the plurality of critical HA components;
the computer receiving a reference value for the performance of the critical HA component;
the computer determining a deviation between the current value and the reference value;
based on the deviation, the computer determining the critical HA component has failed; and
based in part on the failed critical HA component, the categories, and the weights, the computer determining a health index in real-time, the health index indicating in part how much the critical HA component having failed affects a measure of health of the HA system.
2 Assignments
0 Petitions
Accused Products
Abstract
An approach is provided for managing a failure of a critical high availability (HA) component in a HA system. Critical HA components are identified. Categories are assigned to the identified components and weights are assigned to the categories. A current value indicating a performance of a component included in the identified components is obtained by periodically monitoring the components. A reference value for the performance of the component is received. A deviation between the current value and the reference value is determined. Based on the deviation, the component is determined to have failed. Based in part on the failed component, the categories, and the weights, a health index is determined in real-time. The health index indicates in part how much the component having failed affects a measure of health of the HA system.
-
Citations
20 Claims
-
1. A method of managing a failure of a critical high availability (HA) component, the method comprising the steps of:
-
a computer identifying a plurality of critical HA components of a HA system; the computer receiving and assigning categories to the plurality of critical HA components; the computer receiving and assigning weights to the categories; the computer obtaining a current value indicating a performance of a critical HA component included in the plurality of critical HA components, the current value obtained by periodically monitoring the plurality of critical HA components; the computer receiving a reference value for the performance of the critical HA component; the computer determining a deviation between the current value and the reference value; based on the deviation, the computer determining the critical HA component has failed; and based in part on the failed critical HA component, the categories, and the weights, the computer determining a health index in real-time, the health index indicating in part how much the critical HA component having failed affects a measure of health of the HA system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer program product comprising:
-
a computer-readable, tangible storage device; and a computer-readable program code stored in the computer-readable, tangible storage device, the computer-readable program code containing instructions that are executed by a central processing unit (CPU) of a computer system to implement a method of managing a failure of a critical high availability (HA) component, the method comprising the steps of; the computer system identifying a plurality of critical HA components of a HA system; the computer system receiving and assigning categories to the plurality of critical HA components; the computer system receiving and assigning weights to the categories; the computer system obtaining a current value indicating a performance of a critical HA component included in the plurality of critical HA components, the current value obtained by periodically monitoring the plurality of critical HA components; the computer system receiving a reference value for the performance of the critical HA component; the computer system determining a deviation between the current value and the reference value; based on the deviation, the computer system determining the critical HA component has failed; and based in part on the failed critical HA component, the categories, and the weights, the computer system determining a health index in real-time, the health index indicating in part how much the critical HA component having failed affects a measure of health of the HA system. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A method of predicting a reoccurrence of a failure of a critical high availability (HA) component, the method comprising the steps of:
-
a computer determining a real-time failover is happening based on a failure of a HA system; the computer determining an actual amount of time taken by an event occurring during the failover; the computer receiving a reference amount of time that the event occurring during the failover is expected to take; the computer determining the actual amount of time is not equal to the reference amount of time within a predefined tolerance; based on the actual amount of time being not equal to the reference amount of time, the computer predicting the failure will reoccur unless a fault is repaired; the computer identifying critical HA components participating in the event, the critical HA components included in the HA system; and the computer determining a critical HA component included in the identified critical HA components has failed based on the fault and invoking self-healing to repair the fault in the critical HA component by performing a branch based decision making process on the identified critical HA components. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification