Root cause detection and monitoring for storage systems
First Claim
1. A system comprising:
- a host computing device configured to host one or more virtual computing device instances, the host computing device configured to transmit storage commands generated by the one or more virtual computing device instances via a communications network, the one or more virtual computing device instances executing on behalf of a client computing device;
a storage processing service, executed on one or more storage computing devices, the storage processing service configured to;
obtain a storage command request from a client computing device of the one or more virtual computing devices;
process the storage command request to generate a storage command processing result associated with at least one storage volume, the storage volume associated with the one or more storage computing devices; and
collect storage command metric information based at least in part on the storage command processing result; and
a storage monitoring service, executed on one or more computing devices, configured to;
obtain, from the storage processing service, the collected storage command metric information;
process the collected storage command metric information for the at least one storage volume;
identify a correlation relationship of a first storage volume and a second storage volume across the one or more storage computing devices, the correlation relationship further indicating a first fault of the least one storage volume and a second fault of a logical storage component, wherein the logical storage component includes a logical storage level of the first storage volume and the second storage volume including the at least one storage volume;
identify, based at least in part on the identified correlation relationship, one or more faulty storage volumes among the one or more storage computing devices;
obtaining suppression threshold information corresponding to at least one storage volume, the suppression threshold information indicating that a notification for a storage system issue is to be suppressed;
determine that one of the one or more faulty storage volumes corresponds to the suppression threshold information; and
suppress notifications regarding the one faulty storage volume.
1 Assignment
0 Petitions
Accused Products
Abstract
Suppression routines are described for implementation by a monitoring service. The monitoring service uses collected data to identify faulty storage volumes. Advantageously, in some cases, the monitoring service can notify an operator of the storage system that certain storage volumes are faulty. In some embodiments, these notifications are to be suppressed because not all notifications of faulty volumes are necessary. Suppression rules can indicate that a faulty storage volume is at fault because it is a test volume, associated with a large power outage, or some other learned event from storage command metrics. The monitoring service can suppress notifications about these known system issues, among others.
29 Citations
20 Claims
-
1. A system comprising:
-
a host computing device configured to host one or more virtual computing device instances, the host computing device configured to transmit storage commands generated by the one or more virtual computing device instances via a communications network, the one or more virtual computing device instances executing on behalf of a client computing device; a storage processing service, executed on one or more storage computing devices, the storage processing service configured to; obtain a storage command request from a client computing device of the one or more virtual computing devices; process the storage command request to generate a storage command processing result associated with at least one storage volume, the storage volume associated with the one or more storage computing devices; and collect storage command metric information based at least in part on the storage command processing result; and a storage monitoring service, executed on one or more computing devices, configured to; obtain, from the storage processing service, the collected storage command metric information; process the collected storage command metric information for the at least one storage volume; identify a correlation relationship of a first storage volume and a second storage volume across the one or more storage computing devices, the correlation relationship further indicating a first fault of the least one storage volume and a second fault of a logical storage component, wherein the logical storage component includes a logical storage level of the first storage volume and the second storage volume including the at least one storage volume; identify, based at least in part on the identified correlation relationship, one or more faulty storage volumes among the one or more storage computing devices; obtaining suppression threshold information corresponding to at least one storage volume, the suppression threshold information indicating that a notification for a storage system issue is to be suppressed; determine that one of the one or more faulty storage volumes corresponds to the suppression threshold information; and suppress notifications regarding the one faulty storage volume. - View Dependent Claims (2, 3, 4)
-
-
5. A computer-implemented method for managing fault notifications in a storage system comprising:
- obtaining storage command metric information from a storage processing service, the storage command metric information based at least in part on a storage command request from a virtual computing device instance hosted on a host computing device;
obtaining correlation relationship information regarding the obtained storage command metric information, the correlation relationship information indicating a relationship between a first storage volume and a second storage volume, the correlation relationship information further indicating a first fault of at least one storage volume and a second fault of a logical storage component, wherein the logical storage component includes a logical storage level of the first storage volume and the second storage volume including the at least one storage volume;
identifying one or more storage volumes based at least in part on the correlation relationship information, the one or more storage volumes indicating a fault;
obtaining suppression threshold information corresponding to at least one storage volume indicating a fault, the suppression threshold information indicating that a notification for a storage system is to be suppressed;
determining that one of the identified one or more storage volumes corresponds to the suppression threshold information; and
suppressing notifications regarding the identified storage volume. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13)
- obtaining storage command metric information from a storage processing service, the storage command metric information based at least in part on a storage command request from a virtual computing device instance hosted on a host computing device;
-
14. A non-transitory computer-readable storage medium including computer-executable instructions that, when executed by a computing device, cause the computing device to:
- obtain storage command metric information from a storage processing service, the storage command metric it based at least in part on a storage command request from a virtual computing device instance hosted on a host computing device;
obtain correlation relationship information regarding the obtained storage command metric information, the correlation relationship information indicating a relationship between a first storage volume and a second storage volume, the correlation relationship information further indicating a first fault of at least one storage volume and a second fault of a logical storage component, wherein the logical storage component includes a logical storage level of the first storage volume and the second storage volume including the at least one storage volume;
identify one or more storage volumes based at least in part on the correlation relationship information, the one or more storage volumes indicating a fault;
obtain a suppression threshold information corresponding to at least one storage volume, the suppression threshold information indicating that a notification for a storage system is to be suppressed;
determining that the at least one storage volume corresponds to the suppression threshold information; and
suppress notifications regarding the at least one storage volume. - View Dependent Claims (15, 16, 17, 18, 19, 20)
- obtain storage command metric information from a storage processing service, the storage command metric it based at least in part on a storage command request from a virtual computing device instance hosted on a host computing device;
Specification