Detecting and analyzing performance anomalies of client-server based applications

US 10,275,301 B2
Filed: 09/29/2015
Issued: 04/30/2019
Est. Priority Date: 09/29/2015
Status: Active Grant

First Claim

Patent Images

1. A method of detecting and analyzing an anomaly in a performance of an application in a connection between client and server computers, the method comprising the steps of:

a first computer determining a time of a request from the client computer executing the application and an Internet Protocol (IP) address of the client computer, the request being sent by the client computer to the server computer via a communications network;

based on the time of the request from the client computer and the IP address of the client computer, the first computer selecting one or more log entries from a plurality of log entries so that the selected one or more log entries are relevant to the request;

the first computer determining a status code of a response from the server computer and determining that the status code is a Hypertext Transfer Protocol (HTTP) status code of 500 through 599, which indicates the server computer did not properly perform a function in response to the request from the client computer, the response being sent by the server computer to the client computer via the network and responsive to the request;

the first computer determining that the connection timed out in response to the server computer not responding to the request within a predetermined time period;

the first computer calculating values of a round trip latency time (RTT) for multiple client computers having application sessions with the server computer, the values of the RTT including a value of a RTT of the response;

the first computer dividing a space of the values of the RTT into buckets of RTT values, the buckets having a fixed size;

the first computer computing running counts and means for the values of the RTT in each bucket;

the first computer maintaining a boundary value that determines which buckets are in a lower value cluster C₁employed by a k-means clustering algorithm and which other buckets are in a higher value cluster C₂employed by the k-means clustering algorithm, wherein k=2;

the first computer determining the buckets whose RTT values include respective values of the RTT, assigning the values of the RTT to the respective buckets, re-computing the counts and means for each bucket, and balancing C₁and C₂to ensure that (i) values in C₁are closer to a mean μ

₁of C₁and (ii) values in C₂are closer to a mean μ

₂of C₂;

the first computer computing μ

₁of C₁, a standard deviation σ

₁of C₁, μ

₂of C₂, and a standard deviation σ

₂of C₂,the first computer computing a threshold value as μ

₂+2σ

₂if μ

₁+σ

₁≥

μ

₂or as μ

₁+2σ

₁if μ

₁+σ

₁<

μ

₂;

the first computer determining that the value of the RTT of the response exceeds the threshold value;

based on the status code of the response being the HTTP status code of 500 through 599, the value of the RTT exceeding the threshold value, and the connection having timed out in response to the server computer not responding to the request within the predetermined time period, the first computer detecting the anomaly in the performance of the application; and

based on a temporal analysis and textual analysis of log entries associated with the anomaly, and based on an environment analysis that determines activity of the client computer, the server computer, and the network, the first computer determining candidate root causes of a failure that resulted in the anomaly, the failure being in the client computer, the server computer, the network, or a combination of the client computer, the server computer, and the network.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An approach is provided for detecting and analyzing an anomaly in application performance in a client-server connection via a network. A request time and an Internet Protocol (IP) address of the client are determined. Based on the request time and the IP address, log entries relevant to the request are selected. A response code, a round trip latency time (RTT) of the response, and an indication of whether the connection timed out are determined. Based on the status code, the RTT, and the indication of whether connection timed out, the anomaly is detected. Based on temporal and textual analyzes of log entries associated with the anomaly and an environment analysis that determines activity of the client, server, and network, candidate root causes of a failure that resulted in the anomaly are determined.

6 Citations

View as Search Results

19 Claims

1. A method of detecting and analyzing an anomaly in a performance of an application in a connection between client and server computers, the method comprising the steps of:
- a first computer determining a time of a request from the client computer executing the application and an Internet Protocol (IP) address of the client computer, the request being sent by the client computer to the server computer via a communications network;
  
  based on the time of the request from the client computer and the IP address of the client computer, the first computer selecting one or more log entries from a plurality of log entries so that the selected one or more log entries are relevant to the request;
  
  the first computer determining a status code of a response from the server computer and determining that the status code is a Hypertext Transfer Protocol (HTTP) status code of 500 through 599, which indicates the server computer did not properly perform a function in response to the request from the client computer, the response being sent by the server computer to the client computer via the network and responsive to the request;
  
  the first computer determining that the connection timed out in response to the server computer not responding to the request within a predetermined time period;
  
  the first computer calculating values of a round trip latency time (RTT) for multiple client computers having application sessions with the server computer, the values of the RTT including a value of a RTT of the response;
  
  the first computer dividing a space of the values of the RTT into buckets of RTT values, the buckets having a fixed size;
  
  the first computer computing running counts and means for the values of the RTT in each bucket;
  
  the first computer maintaining a boundary value that determines which buckets are in a lower value cluster C₁employed by a k-means clustering algorithm and which other buckets are in a higher value cluster C₂employed by the k-means clustering algorithm, wherein k=2;
  
  the first computer determining the buckets whose RTT values include respective values of the RTT, assigning the values of the RTT to the respective buckets, re-computing the counts and means for each bucket, and balancing C₁and C₂to ensure that (i) values in C₁are closer to a mean μ
  
  ₁of C₁and (ii) values in C₂are closer to a mean μ
  
  ₂of C₂;
  
  the first computer computing μ
  
  ₁of C₁, a standard deviation σ
  
  ₁of C₁, μ
  
  ₂of C₂, and a standard deviation σ
  
  ₂of C₂,the first computer computing a threshold value as μ
  
  ₂+2σ
  
  ₂if μ
  
  ₁+σ
  
  ₁≥
  
  μ
  
  ₂or as μ
  
  ₁+2σ
  
  ₁if μ
  
  ₁+σ
  
  ₁<
  
  μ
  
  ₂;
  
  the first computer determining that the value of the RTT of the response exceeds the threshold value;
  
  based on the status code of the response being the HTTP status code of 500 through 599, the value of the RTT exceeding the threshold value, and the connection having timed out in response to the server computer not responding to the request within the predetermined time period, the first computer detecting the anomaly in the performance of the application; and
  
  based on a temporal analysis and textual analysis of log entries associated with the anomaly, and based on an environment analysis that determines activity of the client computer, the server computer, and the network, the first computer determining candidate root causes of a failure that resulted in the anomaly, the failure being in the client computer, the server computer, the network, or a combination of the client computer, the server computer, and the network.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, further comprising the steps of:
    - the first computer determining a period of time relevant to the anomaly;
      
      based on the period of time, the first computer selecting relevant entities from among the client computer, the server computer, and components of the communications network;
      
      based on the selected relevant entities and the period of time, the first computer selecting log entries from logs provided by the relevant entities;
      
      subsequent to the step of selecting the log entries, the first computer filtering the selected log entries based on keywords that specify anomalies;
      
      the first computer determining a usage of a central processing unit (CPU) of the server computer, a usage of a memory by the server computer, and an input/output (I/O) activity of the server computer; and
      
      based on the filtered log entries, the usage of the CPU, the usage of the memory, and the I/O activity, the first computer determining whether each of the client computer, the server computer, and the components of the communications network was active or inactive at a time of an occurrence of the anomaly, wherein the step of determining the candidate root causes is based in part on whether each of the client computer, the server computer and the components of the communications network is determined to have been active or inactive at the time of the occurrence of the anomaly.
  - 3. The method of claim 2, further comprising the steps of:
    - the first computer determining one or more components of the server computer were active at the time of the occurrence of the anomaly; and
      
      based on the filtered log entries, the usage of the CPU, the usage of the memory, and the I/O activity, the first computer determining whether the one or more components of the server computer were performing tasks relevant to the application or extraneous to the application, wherein the step of determining the candidate root causes is based in part on whether the one or more components of the server computer were performing tasks relevant to the application or extraneous to the application.
  - 4. The method of claim 1, further comprising the steps of:
    - the first computer determining confidences of the respective candidate root causes, each confidence indicating how likely the respective root cause is an actual root cause of the anomaly; and
      
      the first computer presenting the candidate root causes in an order which is based on the confidences.
  - 5. The method of claim 1, further comprising the steps of:
    - the first computer determining the anomaly specifies a type of an alert;
      
      the first computer determining a role of a user;
      
      the first computer determining an association between the type of the alert and the role of the user; and
      
      based on the association between the type of the alert and the role of the user, the first computer presenting the alert to the user, the alert notifying the user about the anomaly.
  - 6. The method of claim 5, further comprising the steps of:
    - the first computer collecting attributes of the anomaly and sending the attributes to a machine learning process, the attributes including the RTT, the indication of whether the connection timed out;
      
      a delay value of the connection, details of the server computer and the application, details about a function specified by the request, and a uniform resource locator of the server computer;
      
      the first computer receiving feedback from the user about whether the anomaly was correctly detected or incorrectly detected;
      
      the first computer utilizing the feedback as a label of the machine learning process;
      
      based on the collected attributes, the first computer generating a machine learning model for the machine learning process, the machine learning model including rules specifying subsequent anomalies;
      
      the first computer updating the machine learning model continuously or at specified time intervals; and
      
      based on the machine learning model or the updated machine learning model, the first computer detecting a subsequent anomaly in the performance of the application, wherein the subsequent anomaly is more likely to be accurately detected than the anomaly detected by the prior step of detecting the anomaly.
  - 7. The method of claim 1, further comprising the step of:
    - providing at least one support service for at least one of creating, integrating, hosting, maintaining, and deploying computer-readable program code in the computer, the program code being executed by a processor of the computer to implement the steps of determining the time of the request and the IP address of the client computer, selecting the one or more log entries, determining the status code of the response, the RTT, and the indication of whether the connection timed out, detecting the anomaly, and determining the candidate root causes of the failure that resulted in the anomaly.

8. A computer program product, comprising:
- a computer-readable storage device; and
  
  a computer-readable program code stored in the computer-readable storage device, the computer-readable program code containing instructions that are executed by a central processing unit (CPU) of a computer system to implement a method of detecting and analyzing an anomaly in a performance of an application in a connection between client and server computers, the method comprising the steps of;
  
  the computer system determining a time of a request from the client computer executing the application and an Internet Protocol (IP) address of the client computer, the request being sent by the client computer to the server computer via a communications network;
  
  based on the time of the request from the client computer and the IP address of the client computer, the computer system selecting one or more log entries from a plurality of log entries so that the selected one or more log entries are relevant to the request;
  
  the computer system determining a status code of a response from the server computer and determining that the status code is a Hypertext Transfer Protocol (HTTP) status code of 500 through 599, which indicates the server computer did not properly perform a function in response to the request from the client computer, the response being sent by the server computer to the client computer via the network and responsive to the request;
  
  the computer system determining that the connection timed out in response to the server computer not responding to the request within a predetermined time period;
  
  the computer system calculating values of a round trip latency time (RTT) for multiple client computers having application sessions with the server computer, the values of the RTT including a value of a RTT of the response;
  
  the computer system dividing a space of the values of the RTT into buckets of RTT values, the buckets having a fixed size;
  
  the computer system computing running counts and means for the values of the RTT in each bucket;
  
  the computer system maintaining a boundary value that determines which buckets are in a lower value cluster C₁employed by a k-means clustering algorithm and which other buckets are in a higher value cluster C₂employed by the k-means clustering algorithm, wherein k=2;
  
  the computer system determining the buckets whose RTT values include respective values of the RTT, assigning the values of the RTT to the respective buckets, re-computing the counts and means for each bucket, and balancing C₁and C₂to ensure that (i) values in C₁are closer to a mean μ
  
  ₁of C₁and (ii) values in C₂are closer to a mean μ
  
  ₂of C₂;
  
  the computer system computing μ
  
  ₁of C₁, a standard deviation σ
  
  ₁of C₁, μ
  
  ₂of C₂, and a standard deviation σ
  
  ₂of C₂;
  
  the computer system computing a threshold value as μ
  
  ₂+2σ
  
  ₂if μ
  
  ₁+σ
  
  ₁≥
  
  μ
  
  ₂or as μ
  
  ₁+2σ
  
  ₁if μ
  
  ₁+σ
  
  ₁<
  
  μ
  
  ₂;
  
  the computer system determining that the value of the RTT of the response exceeds the threshold value;
  
  based on the status code of the response being the HTTP status code of 500 through 599, the value of the RTT exceeding the threshold value, and the connection having timed out in response to the server computer not responding to the request within the predetermined time period, the computer system detecting the anomaly in the performance of the application; and
  
  based on a temporal analysis and textual analysis of log entries associated with the anomaly, and based on an environment analysis that determines activity of the client computer, the server computer, and the network, the computer system determining candidate root causes of a failure that resulted in the anomaly, the failure being in the client computer, the server computer, the network, or a combination of the client computer, the server computer, and the network.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The computer program product of claim 8, wherein the method further comprises the steps of:
    - the computer system determining a period of time relevant to the anomaly;
      
      based on the period of time, the computer system selecting relevant entities from among the client computer, the server computer, and components of the communications network;
      
      based on the selected relevant entities and the period of time, the computer system selecting log entries from logs provided by the relevant entities;
      
      subsequent to the step of selecting the log entries, the computer system filtering the selected log entries based on keywords that specify anomalies;
      
      the computer system determining a usage of a central processing unit (CPU) of the server computer, a usage of a memory by the server computer, and an input/output (I/O) activity of the server computer; and
      
      based on the filtered log entries, the usage of the CPU, the usage of the memory, and the I/O activity, the computer system determining whether each of the client computer, the server computer, and the components of the communications network was active or inactive at a time of an occurrence of the anomaly, wherein the step of determining the candidate root causes is based in part on whether each of the client computer, the server computer and the components of the communications network is determined to have been active or inactive at the time of the occurrence of the anomaly.
  - 10. The computer program product of claim 9, wherein the method further comprises the steps of:
    - the computer system determining one or more components of the server computer were active at the time of the occurrence of the anomaly; and
      
      based on the filtered log entries, the usage of the CPU, the usage of the memory, and the I/O activity, the computer system determining whether the one or more components of the server computer were performing tasks relevant to the application or extraneous to the application, wherein the step of determining the candidate root causes is based in part on whether the one or more components of the server computer were performing tasks relevant to the application or extraneous to the application.
  - 11. The computer program product of claim 8, wherein the method further comprises the steps of:
    - the computer system determining confidences of the respective candidate root causes, each confidence indicating how likely the respective root cause is an actual root cause of the anomaly; and
      
      the computer system presenting the candidate root causes in an order which is based on the confidences.
  - 12. The computer program product of claim 8, wherein the method further comprises the steps of:
    - the computer system determining the anomaly specifies a type of an alert;
      
      the computer system determining a role of a user;
      
      the computer system determining an association between the type of the alert and the role of the user; and
      
      based on the association between the type of the alert and the role of the user, the computer system presenting the alert to the user, the alert notifying the user about the anomaly.
  - 13. The computer program product of claim 12, wherein the method further comprises the steps of:
    - the computer system collecting attributes of the anomaly and sending the attributes to a machine learning process, the attributes including the RTT, the indication of whether the connection timed out;
      
      a delay value of the connection, details of the server computer and the application, details about a function specified by the request, and a uniform resource locator of the server computer;
      
      the computer system receiving feedback from the user about whether the anomaly was correctly detected or incorrectly detected;
      
      the computer system utilizing the feedback as a label of the machine learning process;
      
      based on the collected attributes, the computer system generating a machine learning model for the machine learning process, the machine learning model including rules specifying subsequent anomalies;
      
      the computer system updating the machine learning model continuously or at specified time intervals; and
      
      based on the machine learning model or the updated machine learning model, the computer system detecting a subsequent anomaly in the performance of the application, wherein the subsequent anomaly is more likely to be accurately detected than the anomaly detected by the prior step of detecting the anomaly.

14. A computer system comprising:
- a central processing unit (CPU);
  
  a memory coupled to the CPU; and
  
  a computer readable storage device coupled to the CPU, the storage device containing instructions that are executed by the CPU via the memory to implement a method of detecting and analyzing an anomaly in a performance of an application in a connection between client and server computers, the method comprising the steps of;
  
  the computer system determining a time of a request from the client computer executing the application and an Internet Protocol (IP) address of the client computer, the request being sent by the client computer to the server computer via a communications network;
  
  based on the time of the request from the client computer and the IP address of the client computer, the computer system selecting one or more log entries from a plurality of log entries so that the selected one or more log entries are relevant to the request;
  
  the computer system determining a status code of a response from the server computer and determining that the status code is a Hypertext Transfer Protocol (HTTP) status code of 500 through 599, which indicates the server computer did not properly perform a function in response to the request from the client computer, the response being sent by the server computer to the client computer via the network and responsive to the request;
  
  the computer system determining that the connection timed out in response to the server computer not responding to the request within a predetermined time period;
  
  the computer system calculating values of a round trip latency time (RTT) for multiple client computers having application sessions with the server computer, the values of the RTT including a value of a RTT of the response;
  
  the computer system dividing a space of the values of the RTT into buckets of RTT values, the buckets having a fixed size;
  
  the computer system computing running counts and means for the values of the RTT in each bucket;
  
  the computer system maintaining a boundary value that determines which buckets are in a lower value cluster C₁employed by a k-means clustering algorithm and which other buckets are in a higher value cluster C₂employed by the k-means clustering algorithm, wherein k=2;
  
  the computer system determining the buckets whose RTT values include respective values of the RTT, assigning the values of the RTT to the respective buckets, re-computing the counts and means for each bucket, and balancing C₁and C₂to ensure that (i) values in C₁are closer to a mean μ
  
  ₁of C₁and (ii) values in C₂are closer to a mean μ
  
  ₂of C₂;
  
  the computer system computing μ
  
  ₁of C₁, a standard deviation σ
  
  ₁of C₁, μ
  
  ₂of C₂, and a standard deviation σ
  
  ₂of C₂,the computer system computing a threshold value as μ
  
  ₂+2σ
  
  ₂if μ
  
  ₁+σ
  
  ₁≥
  
  μ
  
  ₂or as μ
  
  ₁+2σ
  
  ₁if μ
  
  ₁+σ
  
  ₁<
  
  μ
  
  ₂;
  
  the computer system determining that the value of the RTT of the response exceeds the threshold value;
  
  based on the status code of the response being the HTTP status code of 500 through 599, the value of the RTT exceeding the threshold value, and the connection having timed out in response to the server computer not responding to the request within the predetermined time period, the computer system detecting the anomaly in the performance of the application; and
  
  based on a temporal analysis and textual analysis of log entries associated with the anomaly, and based on an environment analysis that determines activity of the client computer, the server computer, and the network, the computer system determining candidate root causes of a failure that resulted in the anomaly, the failure being in the client computer, the server computer, the network, or a combination of the client computer, the server computer, and the network.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The computer system of claim 14, wherein the method further comprises the steps of:
    - the computer system determining a period of time relevant to the anomaly;
      
      based on the period of time, the computer system selecting relevant entities from among the client computer, the server computer, and components of the communications network;
      
      based on the selected relevant entities and the period of time, the computer system selecting log entries from logs provided by the relevant entities;
      
      subsequent to the step of selecting the log entries, the computer system filtering the selected log entries based on keywords that specify anomalies;
      
      the computer system determining a usage of a central processing unit (CPU) of the server computer, a usage of a memory by the server computer, and an input/output (I/O) activity of the server computer; and
      
      based on the filtered log entries, the usage of the CPU, the usage of the memory, and the I/O activity, the computer system determining whether each of the client computer, the server computer, and the components of the communications network was active or inactive at a time of an occurrence of the anomaly, wherein the step of determining the candidate root causes is based in part on whether each of the client computer, the server computer and the components of the communications network is determined to have been active or inactive at the time of the occurrence of the anomaly.
  - 16. The computer system of claim 15, wherein the method further comprises the steps of:
    - the computer system determining one or more components of the server computer were active at the time of the occurrence of the anomaly; and
      
      based on the filtered log entries, the usage of the CPU, the usage of the memory, and the I/O activity, the computer system determining whether the one or more components of the server computer were performing tasks relevant to the application or extraneous to the application, wherein the step of determining the candidate root causes is based in part on whether the one or more components of the server computer were performing tasks relevant to the application or extraneous to the application.
  - 17. The computer system of claim 14, wherein the method further comprises the steps of:
    - the computer system determining confidences of the respective candidate root causes, each confidence indicating how likely the respective root cause is an actual root cause of the anomaly; and
      
      the computer system presenting the candidate root causes in an order which is based on the confidences.
  - 18. The computer system of claim 14, wherein the method further comprises the steps of:
    - the computer system determining the anomaly specifies a type of an alert;
      
      the computer system determining a role of a user;
      
      the computer system determining an association between the type of the alert and the role of the user; and
      
      based on the association between the type of the alert and the role of the user, the computer system presenting the alert to the user, the alert notifying the user about the anomaly.
  - 19. The computer system of claim 18, wherein the method further comprises the steps of:
    - the computer system collecting attributes of the anomaly and sending the attributes to a machine learning process, the attributes including the RTT, the indication of whether the connection timed out;
      
      a delay value of the connection, details of the server computer and the application, details about a function specified by the request, and a uniform resource locator of the server computer;
      
      the computer system receiving feedback from the user about whether the anomaly was correctly detected or incorrectly detected;
      
      the computer system utilizing the feedback as a label of the machine learning process;
      
      based on the collected attributes, the computer system generating a machine learning model for the machine learning process, the machine learning model including rules specifying subsequent anomalies;
      
      the computer system updating the machine learning model continuously or at specified time intervals; and
      
      based on the machine learning model or the updated machine learning model, the computer system detecting a subsequent anomaly in the performance of the application, wherein the subsequent anomaly is more likely to be accurately detected than the anomaly detected by the prior step of detecting the anomaly.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kyndryl Incorporated
Original Assignee
International Business Machines Corporation
Inventors
Cherbakov, Luba, Dey, Kuntal, Mukherjea, Sougata, Rajput, Nitendra, Ramakrishna, Venkatraman
Primary Examiner(s)
Starks, Wilbert L

Application Number

US14/869,129
Publication Number

US 20170091008A1
Time in Patent Office

1,309 Days
Field of Search

706 12
US Class Current
CPC Class Codes

G06F 11/0709   in a distributed system con...

G06F 11/0742   in a data processing system...

G06F 11/0748   in a remote unit communicat...

G06F 11/079   Root cause analysis, i.e. e...

G06N 20/00   Machine learning

G06N 5/04   Inference or reasoning models

H04L 41/00   Arrangements for maintenanc...

H04L 41/046   comprising network manageme...

H04L 41/064   involving time analysis

H04L 41/16   using machine learning or a...

H04L 43/0817   by checking functioning

H04L 43/0864   Round trip delays

H04L 43/16   Threshold monitoring

Detecting and analyzing performance anomalies of client-server based applications

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

6 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Detecting and analyzing performance anomalies of client-server based applications

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

6 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links