Identifying correlated events in a distributed system according to operational metrics
First Claim
1. A system, comprising:
- a plurality of computing nodes of a distributed system of a service provider that implements a plurality of network-based services of a provider network that provides the network-based services for multiple clients of the service provider, the plurality of network-based services comprising;
a monitoring service to monitor other network-based services of the plurality of network-based services, the monitoring service configured to;
collect data values for a plurality of operational metrics from the other network-based services, wherein the operational metrics indicate operation of the network-based services provided for the clients or operation of the distributed system as a whole;
evaluate at least some of the data values for the operational metrics to determine one or more measures of correlation amongst the operational metrics;
detect, based at least in part on a particular measure of correlation between two or more of the operational metrics exceeding a threshold value, a correlated event at the network services; and
perform, based on the detected correlated event, a responsive action with respect to the correlated event.
1 Assignment
0 Petitions
Accused Products
Abstract
A distributed system may implement identifying correlated events in a distributed system according to operational metrics. A distributed system may collect large numbers of operational metrics from multiple different sources. Some operational metrics may be monitored, analyzing the operational metrics for correlation with other operational metrics. The monitored operational metrics may be manually selected, or identified according to anomalous events detected for the operational metrics. Based on the monitoring, a correlated event may be detected. A response for the correlated event may be determined and performed. In some embodiments, a notification of the correlated event may be sent. Corrective actions may be performed at the distributed system, in some embodiments.
93 Citations
20 Claims
-
1. A system, comprising:
a plurality of computing nodes of a distributed system of a service provider that implements a plurality of network-based services of a provider network that provides the network-based services for multiple clients of the service provider, the plurality of network-based services comprising; a monitoring service to monitor other network-based services of the plurality of network-based services, the monitoring service configured to; collect data values for a plurality of operational metrics from the other network-based services, wherein the operational metrics indicate operation of the network-based services provided for the clients or operation of the distributed system as a whole; evaluate at least some of the data values for the operational metrics to determine one or more measures of correlation amongst the operational metrics; detect, based at least in part on a particular measure of correlation between two or more of the operational metrics exceeding a threshold value, a correlated event at the network services; and perform, based on the detected correlated event, a responsive action with respect to the correlated event. - View Dependent Claims (2, 3, 4)
-
5. A method, comprising:
performing, by one or more computing devices of a service provider; monitoring data values of a plurality of operational metrics collected from a plurality of sources in a distributed system that implements a plurality of network-based services for multiple clients of the service provider, wherein the operational metrics indicate operation of the network-based services provided for the clients or operation of the distributed system as a whole, monitoring comprising; analyzing the data values of the operational metrics for one or more measures of correlation amongst the operational metrics; based, at least in part, on a particular one of the one or more measures of correlation amongst the operational metrics exceeding a threshold value, detecting a correlated event at the distributed system; and determining a responsive action to perform with respect to the correlated event. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13)
-
14. A non-transitory, computer-readable storage medium, storing program instructions that when executed by one or more computing devices cause the one or more computing devices to implement:
-
monitoring data values of a plurality of operational metrics collected from a plurality of network-based services of a distributed system of a provider network that provides the network-based services for multiple clients of the service provided, wherein the operational metrics indicate operation of the network-based services provided for the clients or operation of the distributed system as a whole, monitoring comprising; analyzing the data values of the operational metrics for one or more measures of correlation amongst the operational metrics; based, at least in part, on a particular one of the one or more measures of correlation amongst the operational metrics exceeding a threshold value, detecting a correlated event at the distributed system; and determining a responsive action to perform with respect to the correlated event. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification