Automated root cause analysis
First Claim
Patent Images
1. A computer-implemented method, comprising:
- under the control of one or more computer systems configured with executable instructions,for a plurality of storage volumes in a distributed computing environment, obtaining performance metric values for each of a plurality of performance metrics;
analyzing the obtained performance metrics to determine a probability model that, for each storage volume health state of a plurality of storage volume health states, models probabilities that the volume health state is caused by corresponding issue;
obtaining, for a particular storage volume, a health state for the storage volume and a set of performance metric values for the particular storage volume;
using the determined probability model to determine, based at least in part on the obtained health state and set of performance metric values for the particular storage volume, an issue potentially causing the health state of the particular storage volume; and
provide information identifying the determined issue.
1 Assignment
0 Petitions
Accused Products
Abstract
Various aspects of the performance of computing resources, such as storage volumes, are measured and used to train a probability model. The probability model is used in a query engine that is able to respond receive queries about a computing resource'"'"'s state. The queries may specify a state of the computing resource and provide a set of measurements of the computing resource'"'"'s performance. The query engine may use the probability model, which may be in the form of a contingency table, to provide information that indicates one or more most likely causes of the state.
-
Citations
25 Claims
-
1. A computer-implemented method, comprising:
under the control of one or more computer systems configured with executable instructions, for a plurality of storage volumes in a distributed computing environment, obtaining performance metric values for each of a plurality of performance metrics; analyzing the obtained performance metrics to determine a probability model that, for each storage volume health state of a plurality of storage volume health states, models probabilities that the volume health state is caused by corresponding issue; obtaining, for a particular storage volume, a health state for the storage volume and a set of performance metric values for the particular storage volume; using the determined probability model to determine, based at least in part on the obtained health state and set of performance metric values for the particular storage volume, an issue potentially causing the health state of the particular storage volume; and provide information identifying the determined issue. - View Dependent Claims (2, 3, 4, 5, 6)
-
7. A computer-implemented method, comprising:
- under the control of one or more computer systems configured with executable instructions, obtaining a probability model for computing resources, the probability model based at least in part on measurements of multiple processes involved in operation of the computing resources, wherein the computing resources are storage volumes;
for a particular computing resource, obtaining measurements taken during a performance degradation of the particular computing resource for at least some of the multiple processes; and
using the obtained measurements for the particular computing resource and the probability model to diagnose the performance degradation. - View Dependent Claims (8, 9, 10, 11, 12, 13)
- under the control of one or more computer systems configured with executable instructions, obtaining a probability model for computing resources, the probability model based at least in part on measurements of multiple processes involved in operation of the computing resources, wherein the computing resources are storage volumes;
-
14. A system, comprising:
- one or more processors; and
memory including executable instructions that, when executed by the one or more processors, cause the system to implement at least;
an interface configured to receive a query that specifies measurements for a computing resource, the measurements comprising a measurement for each of a plurality of processes involved in operation of a data storage volume; and
a query engine configured to;
use the measurements to determine, based at least in part on previously obtained measurements for the processes, a response to the query, the response including a probability model generated based at least in part on the previously obtained measurements; and
provide the determined response, wherein the determined response includes information indicating one or more potential causes of a performance degradation of the data storage volume. - View Dependent Claims (15, 16, 17, 18, 19, 20)
- one or more processors; and
-
21. One or more non-transitory computer-readable storage media having collectively stored thereon instructions that, when executed by one or more processors of a computer system, cause the computer system to:
- obtain measurements of processes involved in the operation of a fleet of data storage volumes;
generate, based at least in part on the obtained measurements, a probability model that, for each process of at least a plurality of the processes, is usable to obtain probabilities as a function of at least a data storage volume health state and a measurement of the process;
use the generated probability model to provide information about data storage volumes; and
wherein the generated probability model includes information indicating one or more potential causes of a performance degradation of the data storage volume. - View Dependent Claims (22, 23, 24, 25)
- obtain measurements of processes involved in the operation of a fleet of data storage volumes;
Specification