Detecting and analyzing performance anomalies of client-server based applications
First Claim
1. A method of detecting and analyzing an anomaly in a performance of an application in a connection between client and server computers, the method comprising the steps of:
- a first computer determining a time of a request from the client computer executing the application and an Internet Protocol (IP) address of the client computer, the request being sent by the client computer to the server computer via a communications network;
based on the time of the request from the client computer and the IP address of the client computer, the first computer selecting one or more log entries from a plurality of log entries so that the selected one or more log entries are relevant to the request;
the first computer determining a status code of a response from the server computer and determining that the status code is a Hypertext Transfer Protocol (HTTP) status code of 500 through 599, which indicates the server computer did not properly perform a function in response to the request from the client computer, the response being sent by the server computer to the client computer via the network and responsive to the request;
the first computer determining that the connection timed out in response to the server computer not responding to the request within a predetermined time period;
the first computer calculating values of a round trip latency time (RTT) for multiple client computers having application sessions with the server computer, the values of the RTT including a value of a RTT of the response;
the first computer dividing a space of the values of the RTT into buckets of RTT values, the buckets having a fixed size;
the first computer computing running counts and means for the values of the RTT in each bucket;
the first computer maintaining a boundary value that determines which buckets are in a lower value cluster C1 employed by a k-means clustering algorithm and which other buckets are in a higher value cluster C2 employed by the k-means clustering algorithm, wherein k=2;
the first computer determining the buckets whose RTT values include respective values of the RTT, assigning the values of the RTT to the respective buckets, re-computing the counts and means for each bucket, and balancing C1 and C2 to ensure that (i) values in C1 are closer to a mean μ
1 of C1 and (ii) values in C2 are closer to a mean μ
2 of C2;
the first computer computing μ
1 of C1, a standard deviation σ
1 of C1, μ
2 of C2, and a standard deviation σ
2 of C2,the first computer computing a threshold value as μ
2+2σ
2 if μ
1+σ
1≥
μ
2 or as μ
1+2σ
1 if μ
1+σ
1<
μ
2;
the first computer determining that the value of the RTT of the response exceeds the threshold value;
based on the status code of the response being the HTTP status code of 500 through 599, the value of the RTT exceeding the threshold value, and the connection having timed out in response to the server computer not responding to the request within the predetermined time period, the first computer detecting the anomaly in the performance of the application; and
based on a temporal analysis and textual analysis of log entries associated with the anomaly, and based on an environment analysis that determines activity of the client computer, the server computer, and the network, the first computer determining candidate root causes of a failure that resulted in the anomaly, the failure being in the client computer, the server computer, the network, or a combination of the client computer, the server computer, and the network.
2 Assignments
0 Petitions
Accused Products
Abstract
An approach is provided for detecting and analyzing an anomaly in application performance in a client-server connection via a network. A request time and an Internet Protocol (IP) address of the client are determined. Based on the request time and the IP address, log entries relevant to the request are selected. A response code, a round trip latency time (RTT) of the response, and an indication of whether the connection timed out are determined. Based on the status code, the RTT, and the indication of whether connection timed out, the anomaly is detected. Based on temporal and textual analyzes of log entries associated with the anomaly and an environment analysis that determines activity of the client, server, and network, candidate root causes of a failure that resulted in the anomaly are determined.
6 Citations
19 Claims
-
1. A method of detecting and analyzing an anomaly in a performance of an application in a connection between client and server computers, the method comprising the steps of:
-
a first computer determining a time of a request from the client computer executing the application and an Internet Protocol (IP) address of the client computer, the request being sent by the client computer to the server computer via a communications network; based on the time of the request from the client computer and the IP address of the client computer, the first computer selecting one or more log entries from a plurality of log entries so that the selected one or more log entries are relevant to the request; the first computer determining a status code of a response from the server computer and determining that the status code is a Hypertext Transfer Protocol (HTTP) status code of 500 through 599, which indicates the server computer did not properly perform a function in response to the request from the client computer, the response being sent by the server computer to the client computer via the network and responsive to the request; the first computer determining that the connection timed out in response to the server computer not responding to the request within a predetermined time period; the first computer calculating values of a round trip latency time (RTT) for multiple client computers having application sessions with the server computer, the values of the RTT including a value of a RTT of the response; the first computer dividing a space of the values of the RTT into buckets of RTT values, the buckets having a fixed size; the first computer computing running counts and means for the values of the RTT in each bucket; the first computer maintaining a boundary value that determines which buckets are in a lower value cluster C1 employed by a k-means clustering algorithm and which other buckets are in a higher value cluster C2 employed by the k-means clustering algorithm, wherein k=2; the first computer determining the buckets whose RTT values include respective values of the RTT, assigning the values of the RTT to the respective buckets, re-computing the counts and means for each bucket, and balancing C1 and C2 to ensure that (i) values in C1 are closer to a mean μ
1 of C1 and (ii) values in C2 are closer to a mean μ
2 of C2;the first computer computing μ
1 of C1, a standard deviation σ
1 of C1, μ
2 of C2, and a standard deviation σ
2 of C2,the first computer computing a threshold value as μ
2+2σ
2 if μ
1+σ
1≥
μ
2 or as μ
1+2σ
1 if μ
1+σ
1<
μ
2;the first computer determining that the value of the RTT of the response exceeds the threshold value; based on the status code of the response being the HTTP status code of 500 through 599, the value of the RTT exceeding the threshold value, and the connection having timed out in response to the server computer not responding to the request within the predetermined time period, the first computer detecting the anomaly in the performance of the application; and based on a temporal analysis and textual analysis of log entries associated with the anomaly, and based on an environment analysis that determines activity of the client computer, the server computer, and the network, the first computer determining candidate root causes of a failure that resulted in the anomaly, the failure being in the client computer, the server computer, the network, or a combination of the client computer, the server computer, and the network. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer program product, comprising:
-
a computer-readable storage device; and a computer-readable program code stored in the computer-readable storage device, the computer-readable program code containing instructions that are executed by a central processing unit (CPU) of a computer system to implement a method of detecting and analyzing an anomaly in a performance of an application in a connection between client and server computers, the method comprising the steps of; the computer system determining a time of a request from the client computer executing the application and an Internet Protocol (IP) address of the client computer, the request being sent by the client computer to the server computer via a communications network; based on the time of the request from the client computer and the IP address of the client computer, the computer system selecting one or more log entries from a plurality of log entries so that the selected one or more log entries are relevant to the request; the computer system determining a status code of a response from the server computer and determining that the status code is a Hypertext Transfer Protocol (HTTP) status code of 500 through 599, which indicates the server computer did not properly perform a function in response to the request from the client computer, the response being sent by the server computer to the client computer via the network and responsive to the request; the computer system determining that the connection timed out in response to the server computer not responding to the request within a predetermined time period; the computer system calculating values of a round trip latency time (RTT) for multiple client computers having application sessions with the server computer, the values of the RTT including a value of a RTT of the response; the computer system dividing a space of the values of the RTT into buckets of RTT values, the buckets having a fixed size; the computer system computing running counts and means for the values of the RTT in each bucket; the computer system maintaining a boundary value that determines which buckets are in a lower value cluster C1 employed by a k-means clustering algorithm and which other buckets are in a higher value cluster C2 employed by the k-means clustering algorithm, wherein k=2; the computer system determining the buckets whose RTT values include respective values of the RTT, assigning the values of the RTT to the respective buckets, re-computing the counts and means for each bucket, and balancing C1 and C2 to ensure that (i) values in C1 are closer to a mean μ
1 of C1 and (ii) values in C2 are closer to a mean μ
2 of C2;the computer system computing μ
1 of C1, a standard deviation σ
1 of C1, μ
2 of C2, and a standard deviation σ
2 of C2;the computer system computing a threshold value as μ
2+2σ
2 if μ
1+σ
1≥
μ
2 or as μ
1+2σ
1 if μ
1+σ
1<
μ
2;the computer system determining that the value of the RTT of the response exceeds the threshold value; based on the status code of the response being the HTTP status code of 500 through 599, the value of the RTT exceeding the threshold value, and the connection having timed out in response to the server computer not responding to the request within the predetermined time period, the computer system detecting the anomaly in the performance of the application; and based on a temporal analysis and textual analysis of log entries associated with the anomaly, and based on an environment analysis that determines activity of the client computer, the server computer, and the network, the computer system determining candidate root causes of a failure that resulted in the anomaly, the failure being in the client computer, the server computer, the network, or a combination of the client computer, the server computer, and the network. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A computer system comprising:
-
a central processing unit (CPU); a memory coupled to the CPU; and a computer readable storage device coupled to the CPU, the storage device containing instructions that are executed by the CPU via the memory to implement a method of detecting and analyzing an anomaly in a performance of an application in a connection between client and server computers, the method comprising the steps of; the computer system determining a time of a request from the client computer executing the application and an Internet Protocol (IP) address of the client computer, the request being sent by the client computer to the server computer via a communications network; based on the time of the request from the client computer and the IP address of the client computer, the computer system selecting one or more log entries from a plurality of log entries so that the selected one or more log entries are relevant to the request; the computer system determining a status code of a response from the server computer and determining that the status code is a Hypertext Transfer Protocol (HTTP) status code of 500 through 599, which indicates the server computer did not properly perform a function in response to the request from the client computer, the response being sent by the server computer to the client computer via the network and responsive to the request; the computer system determining that the connection timed out in response to the server computer not responding to the request within a predetermined time period; the computer system calculating values of a round trip latency time (RTT) for multiple client computers having application sessions with the server computer, the values of the RTT including a value of a RTT of the response; the computer system dividing a space of the values of the RTT into buckets of RTT values, the buckets having a fixed size; the computer system computing running counts and means for the values of the RTT in each bucket; the computer system maintaining a boundary value that determines which buckets are in a lower value cluster C1 employed by a k-means clustering algorithm and which other buckets are in a higher value cluster C2 employed by the k-means clustering algorithm, wherein k=2; the computer system determining the buckets whose RTT values include respective values of the RTT, assigning the values of the RTT to the respective buckets, re-computing the counts and means for each bucket, and balancing C1 and C2 to ensure that (i) values in C1 are closer to a mean μ
1 of C1 and (ii) values in C2 are closer to a mean μ
2 of C2;the computer system computing μ
1 of C1, a standard deviation σ
1 of C1, μ
2 of C2, and a standard deviation σ
2 of C2,the computer system computing a threshold value as μ
2+2σ
2 if μ
1+σ
1≥
μ
2 or as μ
1+2σ
1 if μ
1+σ
1<
μ
2;the computer system determining that the value of the RTT of the response exceeds the threshold value; based on the status code of the response being the HTTP status code of 500 through 599, the value of the RTT exceeding the threshold value, and the connection having timed out in response to the server computer not responding to the request within the predetermined time period, the computer system detecting the anomaly in the performance of the application; and based on a temporal analysis and textual analysis of log entries associated with the anomaly, and based on an environment analysis that determines activity of the client computer, the server computer, and the network, the computer system determining candidate root causes of a failure that resulted in the anomaly, the failure being in the client computer, the server computer, the network, or a combination of the client computer, the server computer, and the network. - View Dependent Claims (15, 16, 17, 18, 19)
-
Specification