Root cause detection and monitoring for storage systems
First Claim
1. A system comprising:
- one or more virtual computing device instances for transmitting storage requests via a communications network, the one or more virtual computing device instances executing on behalf of a client computing device, and the one or more virtual computing device instances hosted on a host computing device;
a storage processing service, executed on one or more storage computing devices, the storage processing service configured to;
obtain a storage command request from a virtual computing device instance of the one or more virtual computing device instances;
process the storage command request to generate a storage command processing result associated with at least one storage volume, the at least one storage volume associated with a storage computing device of the one or more storage computing devices;
collect storage command metric information based at least in part on the storage command processing result; and
a storage monitoring service, executed on one or more computing devices, configured to;
obtain, from the storage processing service, the collected storage command metric information;
process the collected storage command metric information for the at least one storage volume to generate a correlation processing result for a logical storage component representing a logical storage level of the one or more storage computing devices;
identify, across the one or more storage computing devices, a correlation relationship; and
identify, based on the identified correlation relationship, one or more faulty storage volumes among the one or more storage computing devices;
for each faulty storage volume of the one or more faulty storage volumes, correlate one or more characteristics of a first faulty storage volume with corresponding characteristics of a second faulty storage volume, wherein the one or more characteristics represent logical storage levels of a storage system;
identify a common logical storage component of the first faulty storage volume and the second faulty storage volume;
determine that a quantity of the identified one or more faulty storage volumes satisfies a threshold; and
issue a notification regarding the quantity of the identified one or more faulty storage volumes satisfying the threshold.
1 Assignment
0 Petitions
Accused Products
Abstract
Notification routines are described for implementation by a monitoring service. As part of an exemplary notification routine, a faulty storage volume is correlated at multiple logical storage levels of a storage system with other faulty storage volumes. The correlation pattern can follow a tree-based decision format, where each faulty storage volume is sequentially compared at a lower logical storage level. Advantageously, once a common logical storage component of a group of storage volumes is identified, a notification is issued about the group of faulty storage volumes sharing the common logical storage component. Additionally, notifications can be issued according to a severity level of the group of faulty storage volumes. In some embodiments, before issuing the notification, the group of faulty storage volumes can be compared to a time allowed for the group of faulty storage volume to be at fault.
-
Citations
22 Claims
-
1. A system comprising:
-
one or more virtual computing device instances for transmitting storage requests via a communications network, the one or more virtual computing device instances executing on behalf of a client computing device, and the one or more virtual computing device instances hosted on a host computing device; a storage processing service, executed on one or more storage computing devices, the storage processing service configured to; obtain a storage command request from a virtual computing device instance of the one or more virtual computing device instances; process the storage command request to generate a storage command processing result associated with at least one storage volume, the at least one storage volume associated with a storage computing device of the one or more storage computing devices; collect storage command metric information based at least in part on the storage command processing result; and a storage monitoring service, executed on one or more computing devices, configured to; obtain, from the storage processing service, the collected storage command metric information; process the collected storage command metric information for the at least one storage volume to generate a correlation processing result for a logical storage component representing a logical storage level of the one or more storage computing devices; identify, across the one or more storage computing devices, a correlation relationship; and identify, based on the identified correlation relationship, one or more faulty storage volumes among the one or more storage computing devices; for each faulty storage volume of the one or more faulty storage volumes, correlate one or more characteristics of a first faulty storage volume with corresponding characteristics of a second faulty storage volume, wherein the one or more characteristics represent logical storage levels of a storage system; identify a common logical storage component of the first faulty storage volume and the second faulty storage volume; determine that a quantity of the identified one or more faulty storage volumes satisfies a threshold; and issue a notification regarding the quantity of the identified one or more faulty storage volumes satisfying the threshold. - View Dependent Claims (2, 3, 4)
-
-
5. A computer-implemented method for identifying a common logical storage component of storage volumes, the computer-implemented method comprising:
-
obtaining metric information from a storage processing service, the metric information based at least in part on a storage command request; obtaining correlated relationship information regarding the obtained metric information, the correlated relationship information indicating a relationship between a first fault of at least one storage volume and a second fault of a logical storage component, wherein the logical storage component includes a logical storage level of two or more storage volumes including the at least one storage volume; identifying one or more storage volumes based at least in part on the correlated relationship information, each storage volume indicating a fault; for each storage volume of the one or more storage volumes, correlating a first characteristic of a first storage volume with a second characteristic of a second storage volume; identifying a common logical storage component of the first storage volume and the second storage volume; determining that a quantity of the one or more storage volumes satisfies a threshold; and issuing a notification regarding the quantity of the one or more storage volumes satisfying the threshold. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable storage medium including computer-executable instructions comprising:
-
computer-executable instructions that, when executed by a computing device associated with one or more client computing devices; obtain metric information from a storage processing service, the metric information based at least in part on a storage command request from a virtual computing device instance hosted on a host computing device, wherein the virtual computing device instance is executing the storage command request on behalf of a client computing device; obtain correlated relationship information regarding the obtained metric information, the correlated relationship information indicating a relationship between a first fault of at least one storage volume and a second fault of a logical storage component, wherein the logical storage component includes a logical storage level of two or more storage volumes including the at least one storage volume; and identify one or more storage volumes based at least in part on the correlated relationship information, each storage volume indicating a fault; for each storage volume of the one or more storage volumes, correlate a first characteristic of a first storage volume with a second characteristic of a second storage volume; identify a shared characteristic of the first storage volume and the second storage volume; determine that a quantity of the one or more storage volumes satisfies a threshold; and issue a notification regarding the quantity of the one or more storage volumes satisfying the threshold. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
-
Specification