Large-scale distributed correlation
First Claim
1. A system, comprising:
- one or more computing devices; and
a local-level node, implemented on the one or more computing devices, configured to;
trigger an alert for a performance metric of an application executing on the local-level node, wherein the alert indicates anomalous behavior for the performance metric, andsend the alert to a higher-level node, implemented on the one or more computing devices;
receive a distributed correlation request from the higher-level node, wherein the distributed correlation request is initiated to determine a root cause of the alert;
construct a correlation graph, the correlation graph including a root node representing the performance metric, a plurality of leaf nodes representing other performance metrics correlated with the performance metric, and a plurality of edges connecting the root node and the plurality of leaf nodes, each edge representing a dependent relationship between two performance metrics;
assign a correlation strength to each of the plurality of edges;
select one or more of the plurality of leaf nodes to be included in a correlation result based on the correlation strength assigned to each of the plurality of edges connected to the plurality of leaf nodes; and
send the correlation result to the higher-level node;
wherein the higher-level node is configured to;
select a probable cause of triggering the alert based on the performance metrics represented by the one or more leaf nodes included in the correlation result; and
present the probable cause to a user.
14 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein are system, method, and computer program product embodiments for performing distributed correlation to determine a probable cause for a performance problem detected in an application. An embodiment operates by triggering an alert for a performance metric of an application executing on a local-level node. The alert may be sent to a higher-level node. Upon receiving the alert, the higher-level node may send a distributed correlation request, used to determine a root cause of the alert, to the lower-level node. Upon receiving the distributed correlation request, the lower-level node may produce and send a correlation result to the higher-level node. Upon receiving the correlation result, the higher-level node may select the probable cause of triggering the alert based on the correlation result. The probable cause may then be presented to the user.
-
Citations
17 Claims
-
1. A system, comprising:
-
one or more computing devices; and a local-level node, implemented on the one or more computing devices, configured to; trigger an alert for a performance metric of an application executing on the local-level node, wherein the alert indicates anomalous behavior for the performance metric, and send the alert to a higher-level node, implemented on the one or more computing devices; receive a distributed correlation request from the higher-level node, wherein the distributed correlation request is initiated to determine a root cause of the alert; construct a correlation graph, the correlation graph including a root node representing the performance metric, a plurality of leaf nodes representing other performance metrics correlated with the performance metric, and a plurality of edges connecting the root node and the plurality of leaf nodes, each edge representing a dependent relationship between two performance metrics; assign a correlation strength to each of the plurality of edges; select one or more of the plurality of leaf nodes to be included in a correlation result based on the correlation strength assigned to each of the plurality of edges connected to the plurality of leaf nodes; and send the correlation result to the higher-level node; wherein the higher-level node is configured to; select a probable cause of triggering the alert based on the performance metrics represented by the one or more leaf nodes included in the correlation result; and present the probable cause to a user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method, comprising:
-
triggering, by a local-level node that is implemented on one or more computing devices, an alert for a performance metric of an application executing on the local-level node, wherein the alert indicates anomalous behavior for the performance metric; sending the alert to a higher-level node that is implemented on the one or more computing devices; receiving a distributed correlation request from the higher-level node, wherein the distributed correlation request is initiated to determine a root cause of the alert; constructing a correlation graph, the correlation graph including a root node representing the performance metric, a plurality of leaf nodes representing other performance metrics correlated with the performance metric, and a plurality of edges connecting the root node and the plurality of leaf nodes, each edge representing a dependent relationship between two performance metrics; assigning a correlation strength to each of the plurality of edges; selecting one or more of the plurality of leaf nodes to be included in a correlation result based on the correlation strength assigned to each of the plurality of edges connected to the plurality of leaf nodes; and sending the correlation result to the higher-level node, wherein the higher-level node is configured to select a probable cause of triggering the alert based on the performance metrics represented by the one or more leaf nodes included in the correlation result, and present the probable cause to a user. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory computer readable storage medium having instructions stored thereon that, in response to execution by a computing device, cause the computing device to perform operations for performing distributed correlation in order to diagnose a root cause of anomalous behavior for a performance metric, the operations comprising:
-
triggering, by a local-level node that is implemented on one or more computing devices, an alert for a performance metric of an application executing on the local-level node, wherein the alert indicates anomalous behavior for the performance metric; sending the alert to a higher-level node that is implemented on the one or more computing devices; receiving a distributed correlation request from the higher-level node, wherein the distributed correlation request is initiated to determine a root cause of the alert; constructing a correlation graph, the correlation graph including a root node representing the performance metric, a plurality of leaf nodes representing other performance metrics correlated with the performance metric, and a plurality of edges connecting the root node and the plurality of leaf nodes, each edge representing a dependent relationship between two performance metrics; assigning a correlation strength to each of the plurality of edges; selecting one or more of the plurality of leaf nodes to be included in a correlation result based on the correlation strength assigned to each of the plurality of edges connected to the plurality of leaf nodes; and sending the correlation result to the higher-level node, wherein the higher-level node is configured to select a probable cause of triggering the alert based on the performance metrics represented by the one or more leaf nodes included in the correlation result, and present the probable cause to a user.
-
Specification