System and method for fault detection and recovery
First Claim
Patent Images
1. A method for automatically detecting and recovering from a fault in a microprocessor-based system, comprising:
- reporting the fault as an event;
processing the event including thresholding the event and co-relating the event to a cause;
determining a recovery action as a function of the thresholding, the co-relating, and an elapsed time the system has been running, the recovery action being used to perform one or more of a restart of the system, cleanup of memory and data prior to restart, hardware resets to hardware modules or sub-assemblies, or releasing of resources that are marked unavailable due to the faulty behavior, wherein the recovery action is more aggressive in initial stages and less aggressive in later stages; and
performing the recovery action.
2 Assignments
0 Petitions
Accused Products
Abstract
An apparatus and method for automatically detecting and recovering from a fault in a microprocessor-based system. The apparatus and method utilizes a leaky bucket routine and an event handler procedure. The method may further use Object Oriented techniques that abstracts differences between hardware and software faults to allow for the development of a common framework.
56 Citations
20 Claims
-
1. A method for automatically detecting and recovering from a fault in a microprocessor-based system, comprising:
-
reporting the fault as an event; processing the event including thresholding the event and co-relating the event to a cause; determining a recovery action as a function of the thresholding, the co-relating, and an elapsed time the system has been running, the recovery action being used to perform one or more of a restart of the system, cleanup of memory and data prior to restart, hardware resets to hardware modules or sub-assemblies, or releasing of resources that are marked unavailable due to the faulty behavior, wherein the recovery action is more aggressive in initial stages and less aggressive in later stages; and performing the recovery action.
-
-
2. A method for detecting and recovering from faults in a system represented by an object hierarchy, comprising:
-
in response to an event indicative of a fault condition, creating a list of object-cause pairs relevant to the event; incrementing a count in a leaky bucket associated with each object-cause pair; and if the count of the leaky bucket exceeds a plurality of thresholds, performing an action that has been associated with a highest threshold. - View Dependent Claims (3, 4, 5, 6, 8)
-
-
7. The method as recited in 6, wherein the actions are selected from a list consisting of do nothing;
- generate an alarm to the user of the system about the event;
reset the hardware and re-initialize it and its driver to a clean state;
switch hardware to a spare that is ready to take over in case the current hardware fails;
reload a driver for a hardware object;
kill process and restart it;
kill process and start on a different processor;
restart the system software;
reboot the system; and
power down the system.
- generate an alarm to the user of the system about the event;
-
9. A computer-readable media having instructions for detecting and recovering from faults in a system represented by an object hierarchy, the instructions performing steps comprising:
-
in response to an event indicative of a fault condition, creating a list of object-cause pairs relevant to the event; incrementing a count in a leaky bucket associated with each object-cause pair; and if the count of the leaky bucket exceeds a plurality of thresholds, performing an action that has been associated with a highest threshold. - View Dependent Claims (10, 11, 12, 13, 15)
-
-
14. The media as recited in 13, wherein the actions are selected from a list consisting of do nothing;
- generate an alarm to the user of the system about the event;
reset the hardware and re-initialize it and its driver to a clean state;
switch hardware to a spare that is ready to take over in case the current hardware fails;
reload a driver for a hardware object;
kill process and restart it;
kill process and start on a different processor;
restart the system software;
reboot the system; and
power down the system.
- generate an alarm to the user of the system about the event;
-
16. A method for automatically detecting and recovering from a fault in a microprocessor-based system, comprising:
-
reporting the fault as an event; processing the event including thresholding the event and co-relating the event to a cause; determining a recovery action as a function of the thresholding, the co-relating, and an elapsed time the system has been running, wherein recovery is more aggressive in initial stages and less aggressive in later stages, the recovery actions being used to perform one or more of a restart of the system, cleanup of memory and data prior to restart, hardware resets to hardware modules or sub-assemblies, or releasing of resources that are marked unavailable due to the faulty behavior; and performing the recovery action.
-
-
17. A method for detecting and recovering from faults in a system represented by an object hierarchy, comprising:
-
in response to an event indicative of a fault condition, creating a list of object-cause pairs relevant to the event; incrementing a count in a leaky bucket associated with each object-cause pair; and if the count of the leaky bucket exceeds a plurality of thresholds, performing an action that has been associated with a highest threshold, wherein the plurality of thresholds are defined for at least one leaky bucket, each threshold having an associated action to be performed.
-
-
18. A method for detecting and recovering from faults in a system represented by an object hierarchy, comprising:
-
in response to an event indicative of a fault condition, creating a list of object-cause pairs relevant to the event; incrementing a count in a leaky bucket associated with each object-cause pair; and if the count of the leaky bucket exceeds a plurality of thresholds, performing an action elected from a list consisting of do nothing;
generate an alarm to the user of the system about the event;
reset the hardware and re-initialize it and its driver to a clean state;
switch hardware to a spare that is ready to take over in case the current hardware fails;
reload a driver for a hardware object;
kill process and restart it;
kill process and start on a different processor;
restart the system software;
reboot the system; and
power down the system, wherein the plurality of thresholds are defined for at least one leaky bucket, each threshold having an associated action to be performed.
-
-
19. A computer-readable media having instructions for detecting and recovering from faults in a system represented by an object hierarchy, the instructions performing steps comprising:
-
in response to an event indicative of a fault condition, creating a list of object-cause pairs relevant to the event; incrementing a count in a leaky bucket associated with each object-cause pair; and if the count of the leaky bucket exceeds a plurality of thresholds, performing an action that has been associated with a highest threshold, wherein the plurality of thresholds are defined for a least one leaky bucket each threshold having an associated action to be performed.
-
-
20. A computer-readable media having instructions for detecting and recovering from faults in a system represented by an object hierarchy, the instructions performing steps comprising:
-
in response to an event indicative of a fault condition, creating a list of object-cause pairs relevant to the event; incrementing a count in a leaky bucket associated with each object-cause pair; and if the count of the leaky bucket exceeds a plurality of thresholds, performing an action selected from a list consisting of do nothing;
generate an alarm to the user of the system about the event;
reset the hardware and re-initialize it and its driver to a clean state;
switch hardware to a spare that is ready to take over in case the current hardware fails;
reload a driver for a hardware object;
kill process and restart it;
kill process and start on a different processor;
restart the system software;
reboot the system; and
power down the system, wherein the plurality of thresholds are defined for a least one leaky bucket each threshold having an associated action to be performed.
-
Specification