Two-phase root cause analysis

US 20040225927A1
Filed: 04/22/2003
Published: 11/11/2004
Est. Priority Date: 04/22/2003
Status: Active Grant

First Claim

Patent Images

1. An enterprise fault analysis method, wherein at least a portion of the enterprise is represented by a enterprise-specific fault model having a plurality of nodes, comprising:

receiving an event notification for a first node in the fault model;

performing an up-stream analysis of the fault model beginning at the first node;

identifying a second node, the second node having a status value modified during the up-stream analysis to indicate a failed status;

performing a down-stream analysis of the fault model beginning at the second node;

identifying those nodes in a contiguous path between the second node and the first node in the fault model whose impact values indicate an impacted performance condition in accordance with the down-stream analysis;

reporting the second node as a root cause of the received event notification; and

reporting at least one of the identified nodes as impacted by the root cause of the received event notification and not as root causes of the received event notification.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A two-phase method to perform root-cause analysis over an enterprise-specific fault model is described. In the first phase, an up-stream analysis is performed (beginning at a node generating an alarm event) to identify one or more nodes that may be in failure. In the second phase, a down-stream analysis is performed to identify those nodes in the enterprise whose operational condition are impacted by the prior determined failed nodes. Nodes identified as failed as a result of the up-stream analysis may be reported to a user as failed. Nodes identifies as impacted as a result of the down-stream analysis may be reported to a user as impacted and, beneficially, any failure alarms associated with those impacted nodes may be masked. Up-stream (phase 1) analysis is driven by inference policies associated with various nodes in the enterprise'"'"'s fault model. An inference policy is a rule, or set of rules, for inferring the status or condition of a fault model node based on the status or condition of the node'"'"'s immediately down-stream neighboring nodes. Similarly, down-stream (phase 2) analysis is driven by impact policies associated with various nodes in the enterprise'"'"'s fault model. An impact policy is a rule, or set of rules, for assessing the impact on a fault model node based on the status or condition of the node'"'"'s immediately up-stream neighboring nodes.

Citations

78 Claims

1. An enterprise fault analysis method, wherein at least a portion of the enterprise is represented by a enterprise-specific fault model having a plurality of nodes, comprising:
- receiving an event notification for a first node in the fault model;
  
  performing an up-stream analysis of the fault model beginning at the first node;
  
  identifying a second node, the second node having a status value modified during the up-stream analysis to indicate a failed status;
  
  performing a down-stream analysis of the fault model beginning at the second node;
  
  identifying those nodes in a contiguous path between the second node and the first node in the fault model whose impact values indicate an impacted performance condition in accordance with the down-stream analysis;
  
  reporting the second node as a root cause of the received event notification; and
  
  reporting at least one of the identified nodes as impacted by the root cause of the received event notification and not as root causes of the received event notification.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 2. The method of claim 1, wherein the enterprise-specific fault model comprises an Impact Graph.
  - 3. The method of claim 1, wherein the act of performing an up-stream analysis comprises:
    - evaluating an inference policy associated with the first node and setting a status value associated with the first node in accordance therewith; and
      
      evaluating inference policies associated with up-stream nodes to the first node and setting a status value associated with each evaluated up-stream node in accordance therewith.
  - 4. The method of claim 3, wherein the act of evaluating inference policies is terminated when no up-stream nodes from the last evaluated node exist.
  - 5. The method of claim 4, wherein the act of evaluating inference policies is further terminated when a status value associated with a node does not change based on evaluation of an inference policy associated with the node.
  - 6. The method of claim 4, wherein the act of evaluating inference policies is further terminated when a status value associated with a node is a measured status value.
  - 7. The method of claim 1, wherein the act of identifying a second node further comprises identifying one or more nodes that are most up-stream from the first node.
  - 8. The method of claim 7 further comprising identifying, as the second node, an arbitrary one of the one or more identified nodes.
  - 9. The method of claim 3, wherein the status value associated with a node comprises a Boolean value.
  - 10. The method of claim 3, wherein the status value associated with a node comprises a real-number value.
  - 11. The method of claim 3, wherein a status value associated with a node further has one or more associated attributes.
  - 12. The method of claim 11, wherein one of the one or more associated attributes comprises a temporal attribute.
  - 13. The method of claim 11, wherein one of the one or more associated attributes comprises an indication to identify the status value as being a measured value or an inferred value.
  - 14. The method of claim 1, wherein the act of performing a down-stream analysis comprises:
    - evaluating an impact policy associated with the second node and setting an impact value associated with the second node in accordance therewith; and
      
      evaluating impact policies associated with down-stream nodes to the second node and setting an impact value associated with each evaluated down-stream node in accordance therewith.
  - 15. The method of claim 14, wherein the act of evaluating impact policies is terminated when no down-stream nodes from the last evaluated node exist.
  - 16. The method of claim 15, wherein the act of evaluating impact policies is further terminated when an impact value associated with a node does not change based on evaluation of an impact policy associated with the node.
  - 17. The method of claim 14, wherein the impact value associated with a node comprises a Boolean value.
  - 18. The method of claim 14, wherein the impact value associated with a node comprises a real-number value.
  - 19. The method of claim 14, wherein an impact value associated with a node further has one or more associated attributes.
  - 20. The method of claim 19, wherein one of the one or more associated attributes comprises a temporal attribute.
  - 21. The method of claim 1, wherein the act of reporting the second node as a root cause comprises visually displaying an alarm condition for said second node to a user.
  - 22. The method of claim 1, wherein the act of reporting at least some of the identified nodes as impacted by the root cause comprises visually identifying the at least one of the identified nodes differently from the second node.

23. The method of 22, further comprising filtering event notifications received by at least one of the identified nodes so as to not report said event notification to a user as a root cause failure.

24. A program storage device, readable by a programmable control device, comprising instructions stored on the program storage device for causing the programmable control device to:
- receive an event notification from a first node, said first node one of a plurality of nodes in an enterprise-specific fault model;
  
  perform an up-stream analysis of the fault model beginning at the first node;
  
  identify a second node, the second node having a status value modified during the up-stream analysis to indicate a failed status;
  
  perform a down-stream analysis of the fault model beginning at the second node;
  
  identify those nodes in a contiguous path between the second node and the first node in the fault model whose impact values indicate an impacted performance condition in accordance with the down-stream analysis;
  
  report the second node as a root cause of the received event notification; and
  
  report at least one of the identified nodes as impacted by the root cause of the received event notification and not as root causes of the received event notification.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
- - 25. The program storage device of claim 24, wherein the enterprise-specific fault model comprises an Impact Graph.
  - 26. The program storage device of claim 24, wherein the instructions to perform an up-stream analysis comprise instructions to:
    - evaluate an inference policy associated with the first node and set a status value associated with the first node in accordance therewith; and
      
      evaluate inference policies associated with up-stream nodes to the first node and set a status value associated with each evaluated up-stream node in accordance therewith.
  - 27. The program storage device of claim 26, wherein the instructions to evaluate inference policies stop evaluating up-stream nodes when no up-stream node from the last evaluated node exists.
  - 28. The program storage device of claim 27, wherein the instructions to evaluate inference policies stop evaluating up-stream nodes when a status value associated with a node does not change based on evaluation of an inference policy associated with the node.
  - 29. The program storage device of claim 27, wherein the instructions to evaluate inference policies stop evaluating up-stream nodes when a status value associated with a node is a measured status value.
  - 30. The program storage device of claim 24, wherein the instructions to identify a second node further comprise instructions to identify one or more nodes that are most up-stream from the first node.
  - 31. The program storage device of claim 30 further comprising instructions to identify, as the second node, an arbitrary one of the one or more identified nodes.
  - 32. The program storage device of claim 26, wherein the status value associated with a node comprises a Boolean value.
  - 33. The program storage device of claim 26, wherein the status value associated with a node comprises a real-number value.
  - 34. The program storage device of claim 26, wherein a status value associated with a node further has one or more associated attributes.
  - 35. The program storage device of claim 34, wherein one of the one or more associated attributes comprises a temporal attribute.
  - 36. The program storage device of claim 34, wherein one of the one or more associated attributes comprises an indication to identify the status value as being a measured value or an inferred value.
  - 37. The program storage device of claim 24, wherein the instructions to perform a down-stream analysis comprise instructions to:
    - evaluate an impact policy associated with the second node and set an impact value associated with the second node in accordance therewith; and
      
      evaluate impact policies associated with down-stream nodes to the second node and set an impact value associated with each evaluated down-stream node in accordance therewith.
  - 38. The program storage device of claim 37, wherein the instructions to evaluate impact policies stop evaluating when no down-stream nodes from the last evaluated node exists.
  - 39. The program storage device of claim 38, wherein the instructions to evaluate impact policies is further terminated when an impact value associated with a node does not change based on evaluation of an impact policy associated with the node.
  - 40. The program storage device of claim 37, wherein the impact value associated with a node comprises a Boolean value.
  - 41. The program storage device of claim 37, wherein the impact value associated with a node comprises a real-number value.
  - 42. The program storage device of claim 37, wherein an impact value associated with a node further has one or more associated attributes.
  - 43. The program storage device of claim 42, wherein one of the one or more associated attributes comprises a temporal attribute.
  - 44. The program storage device of claim 24, wherein the instructions to report the second node as a root cause of the received event notification comprise instructions to visually display an alarm condition for said second node to a user.
  - 45. The program storage device of claim 24, wherein the instructions to report at least some of the identified nodes as impacted by the root cause of the received event notification comprise instructions to visually identify the at least one of the identified nodes differently from the second node.
  - 46. The program storage device of claim 45, further comprising instructions to filter event notifications received by at least one of the identified nodes so as to not report said event notification to a user as a root cause failure.

47. An enterprise including a plurality of operatively coupled monitored components, hereinafter referred to as nodes, comprising:
- a first node adapted to generate an event notification message, said first node one of a plurality of nodes in an enterprise-specific fault model; and
  
  a monitor agent operatively coupled to the first node and adapted to receive the event notification message, the monitor agent further adapted to;
  
  perform an up-stream analysis of the fault model beginning at the first node;
  
  identify a second node, the second node having a status value modified during the up-stream analysis to indicate a failed status;
  
  perform a down-stream analysis of the fault model beginning at the second node;
  
  identify those nodes in a contiguous path between the second node and the first node in the fault model whose impact values indicate an impacted performance condition in accordance with the down-stream analysis;
  
  report the second node as a root cause of the received event notification; and
  
  report at least one of the identified nodes as impacted by the root cause of the received event notification and not as root causes of the received event notification.
- View Dependent Claims (48, 49, 50, 51, 52, 53, 54, 55)
- - 48. The enterprise of claim 47, wherein operatively coupled monitored components comprise software applications executing on a computer system.
  - 49. The enterprise of claim 47, wherein operatively coupled monitored components comprise hardware devices for facilitating communication between one or more of the operatively coupled monitored components.
  - 50. The enterprise of claim 47, wherein the enterprise-specific fault model comprises an Impact Graph.
  - 51. The enterprise of claim 47, wherein the monitor agent is further adapted to, during said up-stream analysis:
    - evaluate an inference policy associated with the first node and set a status value associated with the first node in accordance therewith; and
      
      evaluate inference policies associated with up-stream nodes to the first node and set a status value associated with each evaluated up-stream node in accordance therewith.
  - 52. The enterprise of claim 47, wherein the monitor agent is further adapted to, during said down-stream analysis:
    - evaluate an impact policy associated with the second node and set an impact value associated with the second node in accordance therewith; and
      
      evaluate impact policies associated with down-stream nodes to the second node and set an impact value associated with each evaluated down-stream node in accordance therewith.
  - 53. The enterprise of claim 47, wherein the monitor agent is further adapted to report the second node as a root cause of the received event notification by visually displaying an alarm condition for said second node to a user.
  - 54. The enterprise of claim 47, wherein the monitor agent is further adapted to report at least some of the identified nodes as impacted by the root cause of the received event notification comprise instructions to visually identify the at least one of the identified nodes differently from the second node.
  - 55. The enterprise of claim 54, wherein the monitor agent is further adapted to filter event notifications received by at least one of the identified nodes so as to not report said event notification to a user as a root cause failure.

56. A fault analysis method, wherein at least a portion of a system is represented by a system-specific fault model having a plurality of nodes, comprising:
- receiving an event notification for a first node in the fault model;
  
  performing an up-stream analysis of the fault model beginning at the first node;
  
  identifying a second node, the second node having a status value modified during the up-stream analysis to indicate a failed status;
  
  performing a down-stream analysis of the fault model beginning at the second node;
  
  identifying those nodes in a contiguous path between the second node and the first node in the fault model whose impact values indicate an impacted performance condition in accordance with the down-stream analysis;
  
  reporting the second node as a root cause of the received event notification; and
  
  reporting at least one of the identified nodes as impacted by the root cause of the received event notification and not as root causes of the received event notification.
- View Dependent Claims (57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77)
- - 57. The method of claim 56, wherein the system-specific fault model comprises an Impact Graph.
  - 58. The method of claim 56, wherein the act of performing an up-stream analysis comprises:
    - evaluating an inference policy associated with the first node and setting a status value associated with the first node in accordance therewith; and
      
      evaluating inference policies associated with up-stream nodes to the first node and setting a status value associated with each evaluated up-stream node in accordance therewith.
  - 59. The method of claim 58, wherein the act of evaluating inference policies is terminated when no up-stream nodes from the last evaluated node exist.
  - 60. The method of claim 59, wherein the act of evaluating inference policies is further terminated when a status value associated with a node does not change based on evaluation of an inference policy associated with the node.
  - 61. The method of claim 59, wherein the act of evaluating inference policies is further terminated when a status value associated with a node is a measured status value.
  - 62. The method of claim 56, wherein the act of identifying a second node further comprises identifying one or more nodes that are most up-stream from the first node.
  - 63. The method of claim 62 further comprising identifying, as the second node, an arbitrary one of the one or more identified nodes.
  - 64. The method of claim 58, wherein the status value associated with a node comprises a Boolean value.
  - 65. The method of claim 58, wherein the status value associated with a node comprises a real-number value.
  - 66. The method of claim 58, wherein a status value associated with a node further has one or more associated attributes.
  - 67. The method of claim 66, wherein one of the one or more associated attributes comprises a temporal attribute.
  - 68. The method of claim 66, wherein one of the one or more associated attributes comprises an indication to identify the status value as being a measured value or an inferred value.
  - 69. The method of claim 56, wherein the act of performing a down-stream analysis comprises:
    - evaluating an impact policy associated with the second node and setting an impact value associated with the second node in accordance therewith; and
      
      evaluating impact policies associated with down-stream nodes to the second node and setting an impact value associated with each evaluated down-stream node in accordance therewith.
  - 70. The method of claim 69, wherein the act of evaluating impact policies is terminated when no down-stream nodes from the last evaluated node exist.
  - 71. The method of claim 70, wherein the act of evaluating impact policies is further terminated when an impact value associated with a node does not change based on evaluation of an impact policy associated with the node.
  - 72. The method of claim 69, wherein the impact value associated with a node comprises a Boolean value.
  - 73. The method of claim 69, wherein the impact value associated with a node comprises a real-number value.
  - 74. The method of claim 69, wherein an impact value associated with a node further has one or more associated attributes.
  - 75. The method of claim 74, wherein one of the one or more associated attributes comprises a temporal attribute.
  - 76. The method of claim 56, wherein the act of reporting the second node as a root cause comprises visually displaying an alarm condition for said second node to a user.
  - 77. The method of claim 56, wherein the act of reporting at least some of the identified nodes as impacted by the root cause comprises visually identifying the at least one of the identified nodes differently from the second node.

78. The method of 77, further comprising filtering event notifications received by at least one of the identified nodes so as to not report said event notification to a user as a root cause failure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Ken Sidelinger, BMC Software Incorporated (KKR & Co., Inc.)
Original Assignee
BMC Software Incorporated (KKR & Co., Inc.)
Inventors
Warpenburg, Michael R., Scholtes, Michael J.

Granted Patent

US 7,062,683 B2
Time in Patent Office

Days
Field of Search
US Class Current

714/47
CPC Class Codes

G06F 11/0718   in an object-oriented system

G06F 11/079   Root cause analysis, i.e. e...

H04L 41/0631   using root cause analysis; ...

H04L 41/0681   Configuration of triggering...

H04L 41/0893   Assignment of logical group...

H04L 41/0894   Policy-based network config...

H04L 41/16   using machine learning or a...

H04L 43/0811   by checking connectivity

H04L 43/0817   by checking functioning

Two-phase root cause analysis

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

Citations

78 Claims

Specification

Solutions

Use Cases

Quick Links

Two-phase root cause analysis

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

78 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links