Fault handling process for enabling recovery, diagnosis, and self-testing of computer systems
First Claim
1. A method of handling a fault in a computer system comprising:
- recognizing that an initial fault has occurred by operating in a first sequence in the computer system;
invoking an alternative mode of operation for the computer system upon recognizing the initial fault;
using the alternative mode to track performance of the system after the initial fault thereby gathering post-fault state information for fault diagnosis and system recovery; and
preventing a subsequent fault from reoccurring as a result of recovery from the initial fault by using a dynamic state of the computer system to cause the computer system to operate in a second sequence such that the initial fault and the subsequent fault are potentially avoided.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods, apparatus, and computer program products are disclosed for analyzing and recovering from severe to catastrophic faults in a computer system. When a fault that cannot be handled by the computer system'"'"'s normal fault handling processes, a shadow mode created by a fault handling virtual machine is invoked. The fault handling virtual machine executes only when the normally nonrecoverable fault is encountered and executes as a triangulated or shadow mode on the system. Once shadow mode is invoked, fault context data is collected on the system and used to analyze and recover from the fault. More specifically, one or more post-fault stable states are constructed by the fault handling virtual machine. These stable states are used to bring the computer system back to a normal operating state in which the component or action causing the initial nonrecoverable fault is avoided. Persistent faults may be encountered while the virtual machine is attempting to recover from the initial fault.
-
Citations
31 Claims
-
1. A method of handling a fault in a computer system comprising:
-
recognizing that an initial fault has occurred by operating in a first sequence in the computer system;
invoking an alternative mode of operation for the computer system upon recognizing the initial fault;
using the alternative mode to track performance of the system after the initial fault thereby gathering post-fault state information for fault diagnosis and system recovery; and
preventing a subsequent fault from reoccurring as a result of recovery from the initial fault by using a dynamic state of the computer system to cause the computer system to operate in a second sequence such that the initial fault and the subsequent fault are potentially avoided. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
- 9. A method as recited in 7 further comprising mutating the post-fault stable state by performing one of progressing the post-fault stable state and regressing the post-fault stable state, and observing a response of the computer system to the mutated post-fault stable state.
-
20. A fault handling virtual machine installed on a computer system upon detection of an unrecoverable fault, the fault handling virtual machine comprising:
-
a post fault stable state constructor for constructing a normal operating state for the computer system after a fault occurs;
a fault data collector for collecting specific information on the state of the computer system at the time of the fault; and
a fault data examination component for examining the specific information on the state of the computer system after a fault occurs. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28)
-
-
29. A fault handling component in a computer system for handling a severe fault comprising:
-
a means for recognizing that an initial fault has occurred by operating in a first sequence in the computer system;
a means for invoking an alternative mode of operation for the computer system upon recognizing the initial fault;
a means for tracking performance of the system after the initial fault using the alternative mode thereby gathering post-fault state information for fault diagnosis and system recovery; and
a means for preventing a subsequent fault from reoccurring as a result of recovery from the initial fault by using a dynamic state of the computer system to cause the computer system to operate in a second sequence such that the initial fault and the subsequent fault are potentially avoided.
-
-
30. A computer-readable medium containing programmed instructions arranged to handle a fault in a computer system, the computer-readable medium including programmed instructions for:
-
recognizing that an initial fault has occurred by operating in a first sequence in the computer system;
invoking an alternative mode of operation for the computer system upon recognizing the initial fault;
using the alternative mode to track performance of the system after the initial fault thereby gathering post-fault state information for fault diagnosis and system recovery; and
preventing a subsequent fault from reoccurring as a result of recovery from the initial fault by using a dynamic state of the computer system to cause the computer system to operate in a second sequence such that the initial fault and the subsequent fault are potentially avoided.
-
-
31. A component in a computer system for handling a fault in a computer system, the component comprising:
-
a memory; and
a processor coupled to the memory, wherein the processor is programmed to perform the steps of;
recognizing that an initial fault has occurred by operating in a first sequence in the computer system;
invoking an alternative mode of operation for the computer system upon recognizing the initial fault;
using the alternative mode to track performance of the system after the initial fault thereby gathering post-fault state information for fault diagnosis and system recovery; and
preventing a subsequent fault from reoccurring as a result of recovery from the initial fault by using a dynamic state of the computer system to cause the computer system to operate in a second sequence such that the initial fault and the subsequent fault are potentially avoided.
-
Specification