System and method for fault detection and recovery

US 7,000,154 B1
Filed: 05/13/2002
Issued: 02/14/2006
Est. Priority Date: 11/28/2001
Status: Expired due to Fees

First Claim

Patent Images

1. A method for automatically detecting and recovering from a fault in a microprocessor-based system, comprising:

reporting the fault as an event;

processing the event including thresholding the event and co-relating the event to a cause;

determining a recovery action as a function of the thresholding, the co-relating, and an elapsed time the system has been running, the recovery action being used to perform one or more of a restart of the system, cleanup of memory and data prior to restart, hardware resets to hardware modules or sub-assemblies, or releasing of resources that are marked unavailable due to the faulty behavior, wherein the recovery action is more aggressive in initial stages and less aggressive in later stages; and

performing the recovery action.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An apparatus and method for automatically detecting and recovering from a fault in a microprocessor-based system. The apparatus and method utilizes a leaky bucket routine and an event handler procedure. The method may further use Object Oriented techniques that abstracts differences between hardware and software faults to allow for the development of a common framework.

56 Citations

View as Search Results

20 Claims

1. A method for automatically detecting and recovering from a fault in a microprocessor-based system, comprising:
- reporting the fault as an event;
  
  processing the event including thresholding the event and co-relating the event to a cause;
  
  determining a recovery action as a function of the thresholding, the co-relating, and an elapsed time the system has been running, the recovery action being used to perform one or more of a restart of the system, cleanup of memory and data prior to restart, hardware resets to hardware modules or sub-assemblies, or releasing of resources that are marked unavailable due to the faulty behavior, wherein the recovery action is more aggressive in initial stages and less aggressive in later stages; and
  
  performing the recovery action.

2. A method for detecting and recovering from faults in a system represented by an object hierarchy, comprising:
- in response to an event indicative of a fault condition, creating a list of object-cause pairs relevant to the event;
  
  incrementing a count in a leaky bucket associated with each object-cause pair; and
  
  if the count of the leaky bucket exceeds a plurality of thresholds, performing an action that has been associated with a highest threshold.
- View Dependent Claims (3, 4, 5, 6, 8)
- - 3. The method as recited in claim 2, wherein creating the list comprises identifying a first object from which the event originated and including on the list the first object and ancestral objects of the first object.
  - 4. The method as recited in claim 3, wherein creating the list comprises identifying a second object having an associate relationship with the first object in connection with a task that relates to the event and including on the list the second object and ancestral objects of the second object.
  - 5. The method as recited in claim 4, further comprises ordering the object-cause list based on a predetermined priority.
  - 6. The method as recited in claim 2, wherein the plurality of thresholds are defined for at least one leaky bucket, each threshold having an associated action to be performed.
  - 8. The method as recited in claim 2, further comprising determining if a condition within the system is true before performing an action if the threshold is exceeded.

7. The method as recited in 6, wherein the actions are selected from a list consisting of do nothing;
- generate an alarm to the user of the system about the event;
  
  reset the hardware and re-initialize it and its driver to a clean state;
  
  switch hardware to a spare that is ready to take over in case the current hardware fails;
  
  reload a driver for a hardware object;
  
  kill process and restart it;
  
  kill process and start on a different processor;
  
  restart the system software;
  
  reboot the system; and
  
  power down the system.

9. A computer-readable media having instructions for detecting and recovering from faults in a system represented by an object hierarchy, the instructions performing steps comprising:
- in response to an event indicative of a fault condition, creating a list of object-cause pairs relevant to the event;
  
  incrementing a count in a leaky bucket associated with each object-cause pair; and
  
  if the count of the leaky bucket exceeds a plurality of thresholds, performing an action that has been associated with a highest threshold.
- View Dependent Claims (10, 11, 12, 13, 15)
- - 10. The media as recited in claim 9, wherein creating the list comprises identifying a first object from which the event originated and including on the list the first object and ancestral objects of the first object.
  - 11. The media as recited in claim 10, wherein creating the list comprises identifying a second object having an associate relationship with the first object in connection with a task that relates to the event and including on the list the second object and ancestral objects of the second object.
  - 12. The media as recited in claim 11, wherein the instruction further order the object-cause list based on a predetermined priority.
  - 13. The media as recited in claim 9, wherein the plurality of thresholds are defined for a least one leaky bucket each threshold having an associated action to be performed.
  - 15. The media as recited in claim 9, wherein the instructions further determine if a condition within the system is true before performing an action if the threshold is exceeded.

14. The media as recited in 13, wherein the actions are selected from a list consisting of do nothing;
- generate an alarm to the user of the system about the event;
  
  reset the hardware and re-initialize it and its driver to a clean state;
  
  switch hardware to a spare that is ready to take over in case the current hardware fails;
  
  reload a driver for a hardware object;
  
  kill process and restart it;
  
  kill process and start on a different processor;
  
  restart the system software;
  
  reboot the system; and
  
  power down the system.

16. A method for automatically detecting and recovering from a fault in a microprocessor-based system, comprising:
- reporting the fault as an event;
  
  processing the event including thresholding the event and co-relating the event to a cause;
  
  determining a recovery action as a function of the thresholding, the co-relating, and an elapsed time the system has been running, wherein recovery is more aggressive in initial stages and less aggressive in later stages, the recovery actions being used to perform one or more of a restart of the system, cleanup of memory and data prior to restart, hardware resets to hardware modules or sub-assemblies, or releasing of resources that are marked unavailable due to the faulty behavior; and
  
  performing the recovery action.

17. A method for detecting and recovering from faults in a system represented by an object hierarchy, comprising:
- in response to an event indicative of a fault condition, creating a list of object-cause pairs relevant to the event;
  
  incrementing a count in a leaky bucket associated with each object-cause pair; and
  
  if the count of the leaky bucket exceeds a plurality of thresholds, performing an action that has been associated with a highest threshold, wherein the plurality of thresholds are defined for at least one leaky bucket, each threshold having an associated action to be performed.

18. A method for detecting and recovering from faults in a system represented by an object hierarchy, comprising:
- in response to an event indicative of a fault condition, creating a list of object-cause pairs relevant to the event;
  
  incrementing a count in a leaky bucket associated with each object-cause pair; and
  
  if the count of the leaky bucket exceeds a plurality of thresholds, performing an action elected from a list consisting of do nothing;
  
  generate an alarm to the user of the system about the event;
  
  reset the hardware and re-initialize it and its driver to a clean state;
  
  switch hardware to a spare that is ready to take over in case the current hardware fails;
  
  reload a driver for a hardware object;
  
  kill process and restart it;
  
  kill process and start on a different processor;
  
  restart the system software;
  
  reboot the system; and
  
  power down the system, wherein the plurality of thresholds are defined for at least one leaky bucket, each threshold having an associated action to be performed.

19. A computer-readable media having instructions for detecting and recovering from faults in a system represented by an object hierarchy, the instructions performing steps comprising:
- in response to an event indicative of a fault condition, creating a list of object-cause pairs relevant to the event;
  
  incrementing a count in a leaky bucket associated with each object-cause pair; and
  
  if the count of the leaky bucket exceeds a plurality of thresholds, performing an action that has been associated with a highest threshold, wherein the plurality of thresholds are defined for a least one leaky bucket each threshold having an associated action to be performed.

20. A computer-readable media having instructions for detecting and recovering from faults in a system represented by an object hierarchy, the instructions performing steps comprising:
- in response to an event indicative of a fault condition, creating a list of object-cause pairs relevant to the event;
  
  incrementing a count in a leaky bucket associated with each object-cause pair; and
  
  if the count of the leaky bucket exceeds a plurality of thresholds, performing an action selected from a list consisting of do nothing;
  
  generate an alarm to the user of the system about the event;
  
  reset the hardware and re-initialize it and its driver to a clean state;
  
  switch hardware to a spare that is ready to take over in case the current hardware fails;
  
  reload a driver for a hardware object;
  
  kill process and restart it;
  
  kill process and start on a different processor;
  
  restart the system software;
  
  reboot the system; and
  
  power down the system, wherein the plurality of thresholds are defined for a least one leaky bucket each threshold having an associated action to be performed.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intel Corporation
Original Assignee
Intel Corporation
Inventors
Kolluru, Nagendra V., Lash, John K., LeDuc, Douglas E.
Primary Examiner(s)
Beausoliel, Robert
Assistant Examiner(s)
MASKULINSKI, MICHAEL C

Application Number

US10/145,449
Time in Patent Office

1,373 Days
Field of Search

714/47, 714/48
US Class Current

714/47.2
CPC Class Codes

G06F 11/0715   in a system implementing mu...

G06F 11/076   by exceeding a count or rat...

G06F 11/0793   Remedial or corrective acti...

System and method for fault detection and recovery

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

56 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for fault detection and recovery

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

56 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links