Detecting and correcting a failure sequence in a computer system before a failure occurs
First Claim
1. A method for detecting a failure sequence or other undesirable system behavior in a computer system and subsequently taking a corresponding remedial action, comprising:
- receiving instrumentation signals from the computer system while the computer system is operating;
determining from the instrumentation signals if the computer system is in a failure sequence that is likely to lead to undesirable system behavior, such as a system crash, wherein determining if the computer system is in a failure sequence involves;
determining correlations between instrumentation signals in the computer system, wherein determining the correlations involves using a non-linear, non-parametric regression technique to determine the correlations, whereby the correlations can subsequently be used to generate estimated signals,deriving estimated signals for a number of instrumentation signals, wherein each estimated signal is derived from correlations with other instrumentation signals, andcomparing an actual signal with an estimated signal for a number of instrumentation signal to determine whether the computer system is in a failure sequence;
wherein the determination involves considering predetermined multivariate correlations between multiple instrumentation signals and a failure sequence that is likely to lead to undesirable system behavior; and
if the computer system is in a failure sequence that is likely to lead to undesirable system behavior, taking a remedial action.
2 Assignments
0 Petitions
Accused Products
Abstract
One embodiment of the present invention provides a system that detects a failure sequence that leads to undesirable computer system behavior and that subsequently takes a corresponding remedial action. During operation, the system receives instrumentation signals from the computer system while the computer system is operating. The system then uses these instrumentation signals to determine if the computer system is in a failure sequence that is likely to lead to undesirable system behavior, such as a system crash, wherein the determination involves considering predetermined multivariate correlations between multiple instrumentation signals and a failure sequence that is likely to lead to undesirable system behavior. Next, if the computer system is in a failure sequence that is likely to lead to undesirable system behavior, the system takes a remedial action.
-
Citations
21 Claims
-
1. A method for detecting a failure sequence or other undesirable system behavior in a computer system and subsequently taking a corresponding remedial action, comprising:
-
receiving instrumentation signals from the computer system while the computer system is operating; determining from the instrumentation signals if the computer system is in a failure sequence that is likely to lead to undesirable system behavior, such as a system crash, wherein determining if the computer system is in a failure sequence involves; determining correlations between instrumentation signals in the computer system, wherein determining the correlations involves using a non-linear, non-parametric regression technique to determine the correlations, whereby the correlations can subsequently be used to generate estimated signals, deriving estimated signals for a number of instrumentation signals, wherein each estimated signal is derived from correlations with other instrumentation signals, and comparing an actual signal with an estimated signal for a number of instrumentation signal to determine whether the computer system is in a failure sequence; wherein the determination involves considering predetermined multivariate correlations between multiple instrumentation signals and a failure sequence that is likely to lead to undesirable system behavior; and if the computer system is in a failure sequence that is likely to lead to undesirable system behavior, taking a remedial action. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for detecting a failure sequence or other undesirable system behavior in a computer system and subsequently taking a corresponding remedial action, wherein the computer-readable storage medium includes magnetic and optical storage devices, disk drives, magnetic tape, CDs (compact discs), and DVDs (digital versatile discs or digital video discs), the method comprising:
- the method comprising;
receiving instrumentation signals from the computer system while the computer system is operating; determining from the instrumentation signals if the computer system is in a failure sequence that is likely to lead to undesirable system behavior, such as a system crash;
wherein determining if the computer system is in a failure sequence involves;determining correlations between instrumentation signals in the computer system, wherein determining the correlations involves using a non-linear, non-parametric regression technique to determine the correlations, whereby the correlations can subsequently be used to generate estimated signals, deriving estimated signals for a number of instrumentation signals, wherein each estimated signal is derived from correlations with other instrumentation signals, and comparing an actual signal with an estimated signal for a number of instrumentation signal to determine whether the computer system is in a failure sequence; wherein the determination involves considering predetermined multivariate correlations between multiple instrumentation signals and a failure sequence that is likely to lead to undesirable system behavior; and if the computer system is in a failure sequence that is likely to lead to undesirable system behavior, taking a remedial action. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- the method comprising;
-
21. An apparatus that detects a failure sequence or other undesirable system behavior in a computer system and subsequently takes a corresponding remedial action, comprising:
-
a monitoring mechanism configured to monitor instrumentation signals from the computer system while the computer system is operating; a determination mechanism configured to determine from the instrumentation signals if the computer system is in a failure sequence that is likely to lead to undesirable system behavior, such as a system crash, wherein determining if the computer system is in a failure sequence involves; determining correlations between instrumentation signals in the computer system, wherein determining the correlations involves using a non-linear, non-parametric regression technique to determine the correlations, whereby the correlations can subsequently be used to generate estimated signals, deriving estimated signals for a number of instrumentation signals, wherein each estimated signal is derived from correlations with other instrumentation signals, and comparing an actual signal with an estimated signal for a number of instrumentation signal to determine whether the computer system is in a failure sequence; wherein the determination mechanism is based on multivariate correlations between multiple instrumentation signals and a failure sequence that is likely to lead to undesirable system behavior; and a remediation mechanism that is configured to take a remedial action if the computer system is in a failure sequence that is likely to lead to undesirable system behavior.
-
Specification