Systems and methods for predictive failure management
First Claim
Patent Images
1. A system for using continuous failure predictions for proactive failure management in distributed cluster systems, comprising:
- a sampling subsystem configured to continuously monitor and collect operation states of different system components;
an analysis subsystem configured to build classification models to perform on-line failure predictions; and
a failure prevention subsystem configured to take preventive actions on failing components based on failure warnings generated by the analysis subsystem, wherein a pre-failure state of a component is dynamically decided based on a reward function that denotes an optimal trade-off between failure impact and prediction error cost.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for using continuous failure predictions for proactive failure management in distributed cluster systems includes a sampling subsystem configured to continuously monitor and collect operation states of different system components. An analysis subsystem is configured to build classification models to perform on-line failure predictions. A failure prevention subsystem is configured to take preventive actions on failing components based on failure warnings generated by the analysis subsystem.
112 Citations
24 Claims
-
1. A system for using continuous failure predictions for proactive failure management in distributed cluster systems, comprising:
-
a sampling subsystem configured to continuously monitor and collect operation states of different system components; an analysis subsystem configured to build classification models to perform on-line failure predictions; and a failure prevention subsystem configured to take preventive actions on failing components based on failure warnings generated by the analysis subsystem, wherein a pre-failure state of a component is dynamically decided based on a reward function that denotes an optimal trade-off between failure impact and prediction error cost. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20)
-
-
18. A method for proactive failure management in distributed cluster systems, comprising:
-
continuously monitoring and collecting operation states of different components, which include at least one of software and hardware components; building classification models to perform on-line failure predictions for the components; taking preventive actions on failing components based on failure warnings generated by the failure prediction; and dynamically adapting state classification models based on feedback such that parallel classification models we employed to select an optimal prediction model that can optimize a reward function. - View Dependent Claims (21, 22, 23)
-
-
24. A system for using continuous failure predictions for proactive failure management in distributed cluster systems, comprising:
-
a sampling subsystem configured to continuously monitor and collect operation states of different system components wherein the sampling subsystem adaptively adjusts a sampling rate of each monitored component based on its state, such that a higher sampling rate is employed for an object in a pre-failure or failure state and a lower sampling rate is used for an object in a normal state; an analysis subsystem configured to build classification models to perform on-line failure predictions; and a failure prevention subsystem configured to take preventive actions on failing components based on failure warnings generated by the analysis subsystem.
-
Specification