Two-phase root cause analysis
First Claim
1. An enterprise fault analysis method, wherein at least a portion of the enterprise is represented by a enterprise-specific fault model having a plurality of nodes, comprising:
- receiving an event notification for a first node in the fault model;
performing an up-stream analysis of the fault model beginning at the first node;
identifying a second node, the second node having a status value modified during the up-stream analysis to indicate a failed status;
performing a down-stream analysis of the fault model beginning at the second node;
identifying those nodes in a contiguous path between the second node and the first node in the fault model whose impact values indicate an impacted performance condition in accordance with the down-stream analysis;
reporting the second node as a root cause of the received event notification; and
reporting at least one of the identified nodes as impacted by the root cause of the received event notification and not as root causes of the received event notification.
12 Assignments
0 Petitions
Accused Products
Abstract
A two-phase method to perform root-cause analysis over an enterprise-specific fault model is described. In the first phase, an up-stream analysis is performed (beginning at a node generating an alarm event) to identify one or more nodes that may be in failure. In the second phase, a down-stream analysis is performed to identify those nodes in the enterprise whose operational condition are impacted by the prior determined failed nodes. Nodes identified as failed as a result of the up-stream analysis may be reported to a user as failed. Nodes identifies as impacted as a result of the down-stream analysis may be reported to a user as impacted and, beneficially, any failure alarms associated with those impacted nodes may be masked. Up-stream (phase 1) analysis is driven by inference policies associated with various nodes in the enterprise'"'"'s fault model. An inference policy is a rule, or set of rules, for inferring the status or condition of a fault model node based on the status or condition of the node'"'"'s immediately down-stream neighboring nodes. Similarly, down-stream (phase 2) analysis is driven by impact policies associated with various nodes in the enterprise'"'"'s fault model. An impact policy is a rule, or set of rules, for assessing the impact on a fault model node based on the status or condition of the node'"'"'s immediately up-stream neighboring nodes.
-
Citations
78 Claims
-
1. An enterprise fault analysis method, wherein at least a portion of the enterprise is represented by a enterprise-specific fault model having a plurality of nodes, comprising:
-
receiving an event notification for a first node in the fault model;
performing an up-stream analysis of the fault model beginning at the first node;
identifying a second node, the second node having a status value modified during the up-stream analysis to indicate a failed status;
performing a down-stream analysis of the fault model beginning at the second node;
identifying those nodes in a contiguous path between the second node and the first node in the fault model whose impact values indicate an impacted performance condition in accordance with the down-stream analysis;
reporting the second node as a root cause of the received event notification; and
reporting at least one of the identified nodes as impacted by the root cause of the received event notification and not as root causes of the received event notification. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. The method of 22, further comprising filtering event notifications received by at least one of the identified nodes so as to not report said event notification to a user as a root cause failure.
-
24. A program storage device, readable by a programmable control device, comprising instructions stored on the program storage device for causing the programmable control device to:
-
receive an event notification from a first node, said first node one of a plurality of nodes in an enterprise-specific fault model;
perform an up-stream analysis of the fault model beginning at the first node;
identify a second node, the second node having a status value modified during the up-stream analysis to indicate a failed status;
perform a down-stream analysis of the fault model beginning at the second node;
identify those nodes in a contiguous path between the second node and the first node in the fault model whose impact values indicate an impacted performance condition in accordance with the down-stream analysis;
report the second node as a root cause of the received event notification; and
report at least one of the identified nodes as impacted by the root cause of the received event notification and not as root causes of the received event notification. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
-
-
47. An enterprise including a plurality of operatively coupled monitored components, hereinafter referred to as nodes, comprising:
-
a first node adapted to generate an event notification message, said first node one of a plurality of nodes in an enterprise-specific fault model; and
a monitor agent operatively coupled to the first node and adapted to receive the event notification message, the monitor agent further adapted to;
perform an up-stream analysis of the fault model beginning at the first node;
identify a second node, the second node having a status value modified during the up-stream analysis to indicate a failed status;
perform a down-stream analysis of the fault model beginning at the second node;
identify those nodes in a contiguous path between the second node and the first node in the fault model whose impact values indicate an impacted performance condition in accordance with the down-stream analysis;
report the second node as a root cause of the received event notification; and
report at least one of the identified nodes as impacted by the root cause of the received event notification and not as root causes of the received event notification. - View Dependent Claims (48, 49, 50, 51, 52, 53, 54, 55)
-
-
56. A fault analysis method, wherein at least a portion of a system is represented by a system-specific fault model having a plurality of nodes, comprising:
-
receiving an event notification for a first node in the fault model;
performing an up-stream analysis of the fault model beginning at the first node;
identifying a second node, the second node having a status value modified during the up-stream analysis to indicate a failed status;
performing a down-stream analysis of the fault model beginning at the second node;
identifying those nodes in a contiguous path between the second node and the first node in the fault model whose impact values indicate an impacted performance condition in accordance with the down-stream analysis;
reporting the second node as a root cause of the received event notification; and
reporting at least one of the identified nodes as impacted by the root cause of the received event notification and not as root causes of the received event notification. - View Dependent Claims (57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77)
-
-
78. The method of 77, further comprising filtering event notifications received by at least one of the identified nodes so as to not report said event notification to a user as a root cause failure.
Specification