Method and architecture for automated fault diagnosis and correction in a computer system
First Claim
1. A fault management architecture for use in a computer system, the architecture comprising:
- a fault manager suitable for interfacing with diagnostic engines and fault correction agents, the fault manager being suitable for receiving error information and passing this information to the diagnostic engines;
at least one diagnostic engine for receiving error information and identifying a set of fault possibilities associated with the errors contained in the error information;
at least one fault correction agent for receiving the set of fault possibilities from the at least one diagnostic engine and then selecting a diagnosed fault, and then taking appropriate fault resolution action concerning the selected diagnosed fault; and
logs for tracking the status of error information, the status of fault management exercises, and the fault status of resources of the computer system.
2 Assignments
0 Petitions
Accused Products
Abstract
A method, apparatus, and computer program product diagnosing and resolving faults is disclosed. A disclosed fault management architecture includes a fault manager suitable having diagnostic engines and fault correction agents. The diagnostic engines receive error information and identify associated fault possibilities. The fault possibility information is passed to fault correction agents, which diagnose and resolve the associated faults. The architecture uses logs to track the status of error information, the status of fault management exercises, and the fault status of system resources. Additionally, a soft error rate discriminator can be employed to track and resolve soft (correctible) errors in the system. The architecture is extensible allowing additional diagnostic engines and agents to be plugged in to the architecture without interrupting the normal operational flow of the computer system.
102 Citations
59 Claims
-
1. A fault management architecture for use in a computer system, the architecture comprising:
-
a fault manager suitable for interfacing with diagnostic engines and fault correction agents, the fault manager being suitable for receiving error information and passing this information to the diagnostic engines;
at least one diagnostic engine for receiving error information and identifying a set of fault possibilities associated with the errors contained in the error information;
at least one fault correction agent for receiving the set of fault possibilities from the at least one diagnostic engine and then selecting a diagnosed fault, and then taking appropriate fault resolution action concerning the selected diagnosed fault; and
logs for tracking the status of error information, the status of fault management exercises, and the fault status of resources of the computer system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A method for diagnosing and correcting faults in a computer system having a fault management architecture;
- the method comprising;
receiving error information in a fault manager of the computer system;
diagnosing a set of fault possibilities associated with the error information, wherein said diagnosing is accomplished by the computer system; and
resolving the set of set of fault possibilities by choosing a selected fault from among the set of fault possibilities and then resolving the selected fault, wherein said choosing and resolving is accomplished by the computer system. - View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38)
- the method comprising;
-
39. A computer-readable program product for diagnosing and correcting faults in a computer system having a fault management architecture, the computer-readable program product configured to cause a computer to implement the computer-controlled steps of:
-
receiving error information in a fault manager of the computer system;
diagnosing a set of fault possibilities associated with the error information;
choosing a selected fault possibility from among the set of fault possibilities; and
resolving the selected fault possibility to resolve a fault. - View Dependent Claims (40, 41, 42, 43, 44, 45, 46, 47)
-
-
48. A computer system comprising:
-
a processor capable of processing computer readable instructions and generating error information;
a memory capable of storing computer readable information;
computer readable instructions enabling the computer system to capture error information from the computer system and generating error reports;
computer readable instructions enabling the computer system to analyze the error reports and generate a list of fault possibilities associated with the error reports;
computer readable instructions enabling the computer system to determine a probability of occurrence associated with each of the fault possibilities;
computer readable instructions enabling the computer system to determine which of the of fault possibilities is the most likely to have caused the error report and select that as an actionable fault;
computer readable instructions enabling the computer system to resolve the actionable fault; and
computer readable instructions enabling the computer system to understand that the actionable fault has been resolved.
-
-
49. The computer system of 48 further including computer readable instructions enabling the computer system to generate an error log that includes a listing of error reports.
-
50. The computer system of 48 further including computer readable instructions enabling the computer system to generate a fault management exercise log that includes a listing of fault possibilities and the current status of fault diagnosis.
-
51. The computer system of 48 further including computer readable instructions enabling the computer system to generate an automatic system recovery unit log that includes a listing of the current fault status of system resources of the computer system, a listing of fault diagnosis concerning the system resources, and a listing of error reports that led to the of fault diagnosis concerning the system resource;
wherein, in the event of computer system failure, upon system restart, the information in the automatic system recovery unit log can be recalled and analyzed to diagnose faults.
-
52. A computer network system having a fault management architecture configured for use in a computer system, the computer network system comprising:
-
a plurality of nodes interconnected in a network;
a fault manager mounted at a first node on the network and configured to diagnose and resolve faults occurring at said first node. - View Dependent Claims (53, 54, 55, 56, 57, 58, 59)
-
Specification