Systems and methods for predictive failure management

US 7,730,364 B2
Filed: 04/05/2007
Issued: 06/01/2010
Est. Priority Date: 04/05/2007
Status: Active Grant

First Claim

Patent Images

1. A system for using continuous failure predictions for proactive failure management in distributed cluster systems, comprising:

a sampling subsystem configured to continuously monitor and collect operation states of different system components;

an analysis subsystem configured to build classification models to perform on-line failure predictions; and

a failure prevention subsystem configured to take preventive actions on failing components based on failure warnings generated by the analysis subsystem, wherein a pre-failure state of a component is dynamically decided based on a reward function that denotes an optimal trade-off between failure impact and prediction error cost.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for using continuous failure predictions for proactive failure management in distributed cluster systems includes a sampling subsystem configured to continuously monitor and collect operation states of different system components. An analysis subsystem is configured to build classification models to perform on-line failure predictions. A failure prevention subsystem is configured to take preventive actions on failing components based on failure warnings generated by the analysis subsystem.

112 Citations

24 Claims

1. A system for using continuous failure predictions for proactive failure management in distributed cluster systems, comprising:
- a sampling subsystem configured to continuously monitor and collect operation states of different system components;
  
  an analysis subsystem configured to build classification models to perform on-line failure predictions; and
  
  a failure prevention subsystem configured to take preventive actions on failing components based on failure warnings generated by the analysis subsystem, wherein a pre-failure state of a component is dynamically decided based on a reward function that denotes an optimal trade-off between failure impact and prediction error cost.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20)
- - 2. The system as recited in claim 1, wherein the distributed cluster system includes at least one of hardware and software components.
  - 3. The system as recited in claim 1, wherein the system components of the system at any given time instant are characterized by a state which relates to normal or abnormal operations of the component.
  - 4. The system as recited in claim 3, wherein the states include at least one of the following:
    - a failure state which characterizes situations that are problematic and need action to avoid problems in the future;
      
      a pre-failure state which characterizes situations that lead up to a failure and include information to detect an impending failure; and
      
      a normal state which characterizes all other situations.
  - 5. The system as recited in claim 3, wherein the states are estimated based on a set of observable metrics.
  - 6. The system as recited in claim 1, wherein the components include a plurality of operation modes based on an observed component operation state.
  - 7. The system as recited in claim 6, wherein the operational modes include at least one of:
    - a working mode component state which is classified normal for all failure types;
      
      an inspection mode component state which is classified as pre-failure by at least one failure type; and
      
      a repair mode component state which is classified as a failure and failure diagnosis and recovery are performed.
  - 8. The system as recited in claim 1, wherein the sampling subsystem adaptively adjusts a sampling rate of each monitored component based on its state, such that a higher sampling rate is employed for an object in the pre-failure or failure state and a lower sampling rate is used for an object in a normal state.
  - 9. The system as recited in claim 1, wherein the sampling subsystem employs reservoir sampling to maintain a limited size of a training data set.
  - 10. The system as recited in claim 1, wherein the reward function incorporates failure penalty and failure management cost.
  - 11. The system as recited in claim 1, wherein the analysis subsystem builds a state classification model based on historical data.
  - 12. The system as recited in claim 11, wherein the analysis subsystem classifies a monitored component into a pre-failure state and estimates a time-to-failure based on a set of observable metrics.
  - 13. The system as recited in claim 11, wherein the analysis subsystem dynamically adapts the state classification model based on feedback.
  - 14. The system as recited in claim 11, further comprising a parallel classification model ensemble testing module employed to select an optimal prediction model to optimize a reward function.
  - 15. The system as recited in claim 11, further comprising a stream classifier configured to incrementally update classification models.
  - 16. The system as recited in claim 1, further comprising an iterative inspection module configured to take preventive actions on components where a failure alarm has been raised.
  - 17. The system as recited in claim 16, wherein the preventive actions include component isolation, component migration, and replication actions based on predicted failure types.
  - 19. The method as recited in claim 1, further comprising adaptively adjusting a sampling rate of each monitored component based on a state of the component, using a higher sampling rate for an object in a pre-failure state or failure state and using a lower sampling rate for an object in a normal state.
  - 20. The method as recited in claim 19, wherein the pre-failure state is dynamically decided based on a reward function that denotes an optimal trade-off between failure impact and prediction error cost.

18. A method for proactive failure management in distributed cluster systems, comprising:
- continuously monitoring and collecting operation states of different components, which include at least one of software and hardware components;
  
  building classification models to perform on-line failure predictions for the components;
  
  taking preventive actions on failing components based on failure warnings generated by the failure prediction; and
  
  dynamically adapting state classification models based on feedback such that parallel classification models we employed to select an optimal prediction model that can optimize a reward function.
- View Dependent Claims (21, 22, 23)
- - 21. The method as recited in claim 18, further comprising estimating a time-to-failure for components whose state is classified as pre-failure based on a set of observable metrics.
  - 22. The method as recited in claim 18, wherein taking preventative actions includes taking preventative actions on the components where a failure alarm has been raised using an iterative inspection, the preventative actions including at least one of isolation, migration, and replication actions based on predicted failure types.
  - 23. A computer program product for proactive failure management in distributed cluster systems comprising a computer useable storage medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to perform the steps of claim 18.

24. A system for using continuous failure predictions for proactive failure management in distributed cluster systems, comprising:
- a sampling subsystem configured to continuously monitor and collect operation states of different system components wherein the sampling subsystem adaptively adjusts a sampling rate of each monitored component based on its state, such that a higher sampling rate is employed for an object in a pre-failure or failure state and a lower sampling rate is used for an object in a normal state;
  
  an analysis subsystem configured to build classification models to perform on-line failure predictions; and
  
  a failure prevention subsystem configured to take preventive actions on failing components based on failure warnings generated by the analysis subsystem.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Papadimitriou, Spyridon, Gu, Xiaohui, Yu, Philip Shi-lung, Chang, Shu-Ping
Primary Examiner(s)
IQBAL, NADEEM

Application Number

US11/696,795
Publication Number

US 20080250265A1
Time in Patent Office

1,153 Days
Field of Search

714 2- 6, 714 15- 18, 714/25, 714/27, 714 37- 39, 714 47- 50
US Class Current

714/47.2
CPC Class Codes

G06F 11/008   Reliability or availability...

G06F 11/0709   in a distributed system con...

G06F 11/0751   Error or fault detection no...

H04L 41/0663   Performing the actions pred...

H04L 41/147   for predicting network beha...

H04L 43/022   by sampling

H04L 43/0817   by checking functioning

Systems and methods for predictive failure management

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

112 Citations

24 Claims

Specification

Use Cases

Quick Links

Others

Systems and methods for predictive failure management

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

112 Citations

24 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others