COLLABORATIVE TROUBLESHOOTING COMPUTER SYSTEMS USING FAULT TREE ANALYSIS
First Claim
1. A computer-implemented method for troubleshooting a computer system, comprising:
- detecting a fault event in the computer system, wherein the computer system is executing at least one software application;
retrieving a data structure storing a fault tree analysis describing the fault event, wherein the fault tree analysis specifies a hierarchical structure specifying one or more symptoms associated with the detected fault event and one or more root causes associated with the detected fault event;
retrieving one or more data values describing an operational status of the computer system;
determining, based on the fault tree analysis and the one or more data values, a predicted root cause for the fault event; and
presenting the predicted root cause on a user interface.
7 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of the invention provide techniques for troubleshooting of computer systems using a fault tree analysis. In one embodiment, data parameters describing a status of a system may be monitored to determine the existence of a fault. In the event of a fault, fault tree analysis metadata may be evaluated to attempt to determine a root cause of the fault. If a root cause can be automatically determined, it may be presented to a user in a troubleshooting console, or may be used to trigger an automated corrective action. Alternatively, if a root cause cannot be automatically determined, the user may be presented with additional fault tree analysis metadata and any relevant data parameters in the troubleshooting console, so that the user may determine the root cause of the fault event.
60 Citations
24 Claims
-
1. A computer-implemented method for troubleshooting a computer system, comprising:
-
detecting a fault event in the computer system, wherein the computer system is executing at least one software application; retrieving a data structure storing a fault tree analysis describing the fault event, wherein the fault tree analysis specifies a hierarchical structure specifying one or more symptoms associated with the detected fault event and one or more root causes associated with the detected fault event; retrieving one or more data values describing an operational status of the computer system; determining, based on the fault tree analysis and the one or more data values, a predicted root cause for the fault event; and presenting the predicted root cause on a user interface. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer useable storage medium having a computer readable program, wherein the computer readable program when executed on a computer causes the computer to perform an operation, comprising:
-
detecting a fault event in the computer system, wherein the computer system is executing at least one software application; retrieving a data structure storing a fault tree analysis describing the fault event, wherein the fault tree analysis specifies a hierarchical structure specifying one or more symptoms associated with the detected fault event and one or more root causes associated with the detected fault event; retrieving one or more data values describing an operational status of the computer system; determining, based on the fault tree analysis and the one or more data values, a predicted root cause for the fault event; and presenting the predicted root cause on a user interface. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A system, comprising:
-
a processor; and a memory containing a monitoring program configured to monitor the availability of a networked software application, wherein the monitoring program, when executed on the processor, is configured to; detect a fault event in the computer system, wherein the computer system is executing at least one software application; retrieve a data structure storing a fault tree analysis describing the fault event, wherein the fault tree analysis specifies a hierarchical structure specifying one or more symptoms associated with the detected fault event and one or more root causes associated with the detected fault event; retrieve one or more data values describing an operational status of the computer system; determine, based on the fault tree analysis and the one or more data values, a predicted root cause for the fault event; and present the predicted root cause on a user interface. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
Specification