INTER-PROCESSOR FAILURE DETECTION AND RECOVERY

US 20120089861A1
Filed: 10/12/2010
Published: 04/12/2012
Est. Priority Date: 10/12/2010
Status: Active Grant

First Claim

Patent Images

1. An apparatus for detecting processor failure in a multi-processor device, the apparatus comprising:

a retrieval module configured to retrieve a timestamp_n+1from a shared memory that is shared by a plurality of central processing units (CPUs), wherein the timestamp_n+1is written to the shared memory by a CPU_n+1;

a comparison module configured to compare the timestamp_n+1to a timestamp_ngenerated by a CPU_nchecking the CPU_n+1for failure and determine a delta value;

the comparison module further configured to compare the delta value with a threshold value and determine whether the CPU_n+1has failed; and

a detection module configured to, in response to the comparison module determining that the CPU_n+1has failed, initiate error handling for the plurality of CPUs.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An approach to detecting processor failure in a multi-processor environment is disclosed. The approach may include having each CPU in the system responsible for monitoring another CPU in the system. A CPU_nreads a timestamp_n+1created by CPU_n+1which CPU_nis monitoring from a shared memory location. The CPU_nreads its own timestamp_nand compares the two timestamps to calculate a delta value. If the delta value is above a threshold, the CPU_ndetermines that CPU_n+1has failed and initiates error handling for the CPUs in the system. One CPU may be designated a master CPU, and be responsible for beginning the error handling process. In such embodiments, the CPU_nmay initiate error handling by notifying the master CPU that CPU_n+1has failed. If CPU_n+1is the master CPU, the CPU_nmay take additional steps to initiate error handling, and may broadcast a non-critical interrupt to all CPUs, triggering error handling.

78 Citations

View as Search Results

20 Claims

1. An apparatus for detecting processor failure in a multi-processor device, the apparatus comprising:
- a retrieval module configured to retrieve a timestamp_n+1from a shared memory that is shared by a plurality of central processing units (CPUs), wherein the timestamp_n+1is written to the shared memory by a CPU_n+1;
  
  a comparison module configured to compare the timestamp_n+1to a timestamp_ngenerated by a CPU_nchecking the CPU_n+1for failure and determine a delta value;
  
  the comparison module further configured to compare the delta value with a threshold value and determine whether the CPU_n+1has failed; and
  
  a detection module configured to, in response to the comparison module determining that the CPU_n+1has failed, initiate error handling for the plurality of CPUs.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The apparatus of claim 1, further comprising a timestamp module to read the timestamp_nfrom hardware and write the timestamp_nto the shared memory.
  - 3. The apparatus of claim 1, wherein the threshold value is set lower than a system threshold value for the system in which the multi-processor device operates.
  - 4. The apparatus of claim 1, wherein the CPU_n+1is not a master CPU and the CPU_nis not the master CPU, initiating error handling comprising the CPU_nnotifying the master CPU of the failure on CPU_n+1, and wherein the master CPU causes the plurality of CPUs to perform error handling.
  - 5. The apparatus of claim 1, wherein the CPU_n+1is a master CPU, the detection module further configured to:
    - send a non-critical interrupt to CPU_n+1;
      
      send a critical interrupt to CPU_n+1in response to the CPU_n+1failing to respond to the non-critical interrupt; and
      
      broadcast a group non-critical interrupt to all CPUs in response the CPU_n+1failing to respond to the critical interrupt, wherein the group non-critical interrupt causes the CPUs to perform error handling.
  - 6. The apparatus of claim 1, wherein each of the plurality of CPUs has its own cache line in the shared memory for writing timestamps.
  - 7. The apparatus of claim 1, the comparison module further configured to add additional time to the timestamp_nprior to comparing the timestamp_n+1to the timestamp_n.
  - 8. The apparatus of claim 7, wherein the additional time accounts for time to move the timestamp_n+1from CPU_n+1to CPU_n.
  - 9. The apparatus of claim 1, wherein the shared memory stores one or more timestamps generated by the plurality of CPUs in a global array.

10. A system for detecting processor failure in a multi-processor device, the system comprising:
- a shared memory that is shared by a plurality of central processing units (CPUs);
  
  a CPU_n+1of the plurality of CPUs configured to write a timestamp_n+1to a global array in the shared memory;
  
  a CPU_nof the plurality of CPUs configured to detect a failure in CPU_n+1, detecting a failure comprising the steps of;
  
  retrieving the timestamp_n+1from the shared memory;
  
  comparing the timestamp_n+1to a timestamp_ngenerated by the CPU_nand determining a delta value;
  
  comparing the delta value with a threshold value and determining whether the CPU_n+1has failed; and
  
  in response to determining that the CPU_n+1has failed, initiating error handling for the plurality of CPUs.
- View Dependent Claims (11, 13, 14)
- - 11. The system of claim 10, wherein the shared memory and the plurality of CPUs are components of a Fibre Channel Storage Host Adapter.
  - 13. The system of claim 10, wherein the CPU_nis a first core in a multi-core processor and the CPU_n+1is a second core in a multi-core processor, the multi-core processor comprising a plurality of cores.
  - 14. The system of claim 10, wherein the CPU_nis configured to read the timestamp_nfrom hardware and write the timestamp_nto the global array.

12. The system of claim 12, wherein the threshold value is set lower than a system threshold value for the Fibre Channel Storage Host Adapter.

15. A method for detecting processor failure in a multi-processor device, the method comprising:
- retrieving a timestamp_n+1from a shared memory that is shared by a plurality of central processing units (CPUs), wherein the timestamp_n+1is written to the shared memory by a CPU_n+1;
  
  comparing the timestamp_n+1to a timestamp_ngenerated by a CPU_nchecking the CPU_n+1for failure; and
  
  in response to the difference between timestamp_n+1and timestamp_nbeing larger than a threshold value, determining that there is a failure on CPU_n+1and initiating error handling for the plurality of CPUs.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The method of claim 15, further comprising reading the timestamp_nfrom hardware and writing the timestamp_nto the shared memory.
  - 17. The method of claim 15, wherein the CPU_n+1is not a master CPU and the CPU_nis not the master CPU, and wherein initiating error handling comprises the CPU_nnotifying the master CPU of the failure on CPU_n+1, and wherein the master CPU causes the plurality of CPUs to perform error handling.
  - 18. The method of claim 15, wherein the CPU_n+1is a master CPU, the method further comprising:
    - sending a non-critical interrupt to CPU_n+1;
      
      sending a critical interrupt to CPU_n+1in response to the CPU_n+1failing to respond to the non-critical interrupt; and
      
      broadcasting a group non-critical interrupt to all CPUs in response the CPU_n+1failing to respond to the critical interrupt, wherein the group non-critical interrupt causes the CPUs to perform error handling.
  - 19. The method of claim 15, wherein each of the plurality of CPUs has its own cache line in the shared memory for writing timestamps.
  - 20. The method of claim 15, further comprising adding additional time to the timestamp_nprior to comparing the timestamp_n+1to the timestamp_n.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Cardinell, Charles S., Hathorn, Roger G., Laubli, Bernhard, Van Patten, Timothy J.

Granted Patent

US 8,850,262 B2
Time in Patent Office

Days
Field of Search
US Class Current

714/2
CPC Class Codes

G06F 11/0724 in a multiprocessor or a mu...

G06F 11/0757 by exceeding a time limit, ...

INTER-PROCESSOR FAILURE DETECTION AND RECOVERY

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

78 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

INTER-PROCESSOR FAILURE DETECTION AND RECOVERY

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

78 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links