Method and apparatus for identifying problem causes in a multi-node system
First Claim
1. A computer-implemented method for determining one or more causes of a problem with a service level objective (SLO) in a network management system (NMS), the method comprising:
- receiving telemetry information describing a condition of at least a first element in the NMS, wherein the first element is a service or a resource used to fulfill the SLO, and wherein the first element is associated with a first node;
invoking a first procedure associated with the first element, wherein the first procedure determines whether the first element is contributing to the problem with the SLO, wherein the problem is indicated by system performance reaching a confidence level specified in the SLO;
determining that the first element has a dependency relationship with a second element in the NMS;
invoking a second procedure associated with the second element, wherein the second procedure determines whether the second element is contributing to the problem with the SLO, and wherein the second element is associated with a second node;
making a determination, using a monitoring node, of one or more causes of the problem with the SLO based at least partially upon results of the first procedure and the second procedure, wherein the problem is based upon at least one from a group consisting of the first element contributing to the problem with the SLO, and the second element contributing to the problem with the SLO, and wherein the NMS comprises the first node, the second node, and the monitoring node operatively connected; and
generating, based on the determination, a problem report that indicates one or more causes of the problem with the SLO;
wherein the method is performed by a computing device which, as a result of executing special purpose instructions, has been configured to be a special purpose device.
2 Assignments
0 Petitions
Accused Products
Abstract
An SLO (service level objective) is represented by a model that includes nodes that represent elements in a system that are used to fulfill the SLO and information that represents dependencies between the elements. Telemetry information is received describing a condition of an element in the system. The telemetry information can be applied to a particular procedure associated with a particular node in the model to determine if there is a problem associated with the element represented by the particular node. At least a portion of the telemetry information is applied to procedures to determine problem cause information describing which elements have problems relating to the SLO. A relative contribution of elements to a problem associated with the SLO is determined by analyzing the problem cause information and the dependencies between the elements.
10 Citations
26 Claims
-
1. A computer-implemented method for determining one or more causes of a problem with a service level objective (SLO) in a network management system (NMS), the method comprising:
-
receiving telemetry information describing a condition of at least a first element in the NMS, wherein the first element is a service or a resource used to fulfill the SLO, and wherein the first element is associated with a first node; invoking a first procedure associated with the first element, wherein the first procedure determines whether the first element is contributing to the problem with the SLO, wherein the problem is indicated by system performance reaching a confidence level specified in the SLO; determining that the first element has a dependency relationship with a second element in the NMS; invoking a second procedure associated with the second element, wherein the second procedure determines whether the second element is contributing to the problem with the SLO, and wherein the second element is associated with a second node; making a determination, using a monitoring node, of one or more causes of the problem with the SLO based at least partially upon results of the first procedure and the second procedure, wherein the problem is based upon at least one from a group consisting of the first element contributing to the problem with the SLO, and the second element contributing to the problem with the SLO, and wherein the NMS comprises the first node, the second node, and the monitoring node operatively connected; and generating, based on the determination, a problem report that indicates one or more causes of the problem with the SLO; wherein the method is performed by a computing device which, as a result of executing special purpose instructions, has been configured to be a special purpose device. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer-readable storage medium storing one or more sequences of instructions embodied therein for execution on a computer system to perform a computer-implemented method, the method comprising:
-
receiving telemetry information describing a condition of at least a first element in a network management system (NMS), wherein the first element is a service or a resource used to fulfill a service level objective (SLO) in the NMS, and wherein the first element is associated with a first node; invoking a first procedure associated with the first element, wherein the first procedure determines whether the first element is contributing to a problem with the SLO, wherein the problem is indicated by system performance reaching a confidence level specified in the SLO; determining that the first element has a dependency relationship with a second element in the NMS; invoking a second procedure associated with the second element, wherein the second procedure determines whether the second element is contributing to the problem with the SLO, and wherein the second element is associated with a second node; making a determination, using a monitoring node, of one or more causes of the problem with the SLO based at least partially upon results of the first procedure and the second procedure, wherein the problem is based upon at least one from a group consisting of the first element contributing to the problem with the SLO, and the second element contributing to the problem with the SLO, and wherein the NMS comprises the first node, the second node, and the monitoring node operatively connected; and generating, based on the determination, a problem report that indicates one or more causes of the problem with the SLO; wherein the method is performed by a computing device which, as a result of executing special purpose instructions, has been configured to be a special purpose device. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer system, comprising:
-
one or more processors; and a storage for storing one or more sequences of instructions which, when executed by the one or more processors, cause the one or more processors to perform the operations of; receiving telemetry information describing a condition of at least a first element in a network management system (NMS), wherein the first element is a service or a resource used to fulfill a service level objective (SLO) in the NMS, and wherein the first element is associated with a first node; invoking a first procedure associated with the first element, wherein the first procedure determines whether the first element is contributing to a problem with the SLO, wherein the problem in indicated by system performance reaching a confidence level specified in the SLO; determining that the first element has a dependency relationship with a second element in the NMS; invoking a second procedure associated with the second element, wherein the second procedure determines whether the second element is contributing to the problem with the SLO, and wherein the second element is associated with a second node; making a determination, using a monitoring node, of one or more causes of the problem with the SLO based at least partially upon results of the first procedure and the second procedure, wherein the problem is based upon at least one from a group consisting of the first element contributing to the problem with the SLO, and the second element contributing to the problem with the SLO, and wherein the NMS comprises the first node, the second node, and the monitoring node operatively connected; and generating, based on the determination, a problem report that indicates one or more causes of the problem with the SLO; wherein the method is performed by a computing device which, as a result of executing special purpose instructions, has been configured to be a special purpose device. - View Dependent Claims (22, 23, 24, 25, 26)
-
Specification