Inter-processor failure detection and recovery

US 8,850,262 B2
Filed: 10/12/2010
Issued: 09/30/2014
Est. Priority Date: 10/12/2010
Status: Active Grant

First Claim

Patent Images

1. An apparatus comprising:

non-transitory computer readable storage medium storing computer readable prongram code executable by a plurality of centaral processing units (CPU), wherein the plurality of CPUs are configured in a ring and each CPU_ndetermines whether a CPU_n+1that is logically adjacent to the CPU_nin the ring has failed, the computer readable program code comprising;

a retrieval module of the CPU_nconfigured to retrieve a timestamp_n+1from a shared memory that is shared by the plurality of CPUs, wherein the timestamp_n+1is written to the shared memory by the CPU_n+1, wherein the CPU_nis a first core in a multi-core processor and the CPU_n+1is a second core in a multi-core processor, the multi-core processor comprising a plurality of cores;

a comparison module of the CPU_nconfigured to compare the timestamp_n+1to a timestamp_ngenerated by a CPU_nchecking the CPU_n+1for failure and determine a delta value;

the comparison module of the CPU_nfurther configured to compare the delta value with a threshold value and determine whether the CPU_n+1has failed; and

a detection module of the CPU_nconfigured to, in response to the comparison module determining that the CPU_n+1has failed, initiate error handling for the plurality of CPUs.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An approach to detecting processor failure in a multi-processor environment is disclosed. The approach may include having each CPU in the system responsible for monitoring another CPU in the system. A CPU_nreads a timestamp_n+1created by CPU_n+1which CPU_nis monitoring from a shared memory location. The CPU_nreads its own timestamp_nand compares the two timestamps to calculate a delta value. If the delta value is above a threshold, the CPU_ndetermines that CPU_n+1has failed and initiates error handling for the CPUs in the system. One CPU may be designated a master CPU, and be responsible for beginning the error handling process. In such embodiments, the CPU_nmay initiate error handling by notifying the master CPU that CPU_n+1has failed. If CPU_n+1is the master CPU, the CPU_nmay take additional steps to initiate error handling, and may broadcast a non-critical interrupt to all CPUs, triggering error handling.

43 Citations

View as Search Results

19 Claims

1. An apparatus comprising:
- non-transitory computer readable storage medium storing computer readable prongram code executable by a plurality of centaral processing units (CPU), wherein the plurality of CPUs are configured in a ring and each CPU_ndetermines whether a CPU_n+1that is logically adjacent to the CPU_nin the ring has failed, the computer readable program code comprising;
  
  a retrieval module of the CPU_nconfigured to retrieve a timestamp_n+1from a shared memory that is shared by the plurality of CPUs, wherein the timestamp_n+1is written to the shared memory by the CPU_n+1, wherein the CPU_nis a first core in a multi-core processor and the CPU_n+1is a second core in a multi-core processor, the multi-core processor comprising a plurality of cores;
  
  a comparison module of the CPU_nconfigured to compare the timestamp_n+1to a timestamp_ngenerated by a CPU_nchecking the CPU_n+1for failure and determine a delta value;
  
  the comparison module of the CPU_nfurther configured to compare the delta value with a threshold value and determine whether the CPU_n+1has failed; and
  
  a detection module of the CPU_nconfigured to, in response to the comparison module determining that the CPU_n+1has failed, initiate error handling for the plurality of CPUs.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The apparatus of claim 1, the computer readable program code comprising further comprising a timestamp module to read the timestamp_nfrom hardware and write the timestamp_nto the shared memory.
  - 3. The apparatus of claim 1, wherein the threshold value is set lower than a system threshold value for the system in which the multi-processor device operates.
  - 4. The apparatus of claim 1, wherein the CPU_n+1is not a master CPU and the CPU_nis not the master CPU, initiating error handling comprising the CPU_nnotifying the master CPU of the failure on CPU_n+1, and wherein the master CPU causes the plurality of CPUs to perform error handling.
  - 5. The apparatus of claim 1, wherein the CPU_n+1is a master CPU, the detection module further configured to:
    - send a non-critical interrupt to CPU_n+1;
      
      send a critical interrupt to CPU_n+1in response to the CPU_n+1failing to respond to the non-critical interrupt; and
      
      broadcast a group non-critical interrupt to all CPUs in response the CPU_n+1failing to respond to the critical interrupt, wherein the group non-critical interrupt causes the CPUs to perform error handling.
  - 6. The apparatus of claim 1, wherein each of the plurality of CPUs has a dedicated cache line in the shared memory for writing timestamps.
  - 7. The apparatus of claim 1, the comparison module further configured to add additional time to the timestamp_nprior to comparing the timestamp_n+1to the timestamp_n.
  - 8. The apparatus of claim 7, wherein the additional time accounts for time to move the timestamp_n+1from CPU_n+1to CPU_n.
  - 9. The apparatus of claim 1, wherein the shared memory stores one or more timestamps generated by the plurality of CPUs in a global array.

10. A system comprising:
- a shared memory that is shared by a plurality of central processing units (CPUs), wherein the plurality of CPUs are configured in a ring and each CPU_ndetermines whether a CPU_n+1that is logically adjacent to the CPU_nin the ring has failed;
  
  the CPU_n+1of the plurality of CPUs configured to write a timestamp_n+1to a global array in the shared memory, wherein the CPU_nis a first core in a multi-core processor and the CPU_n+1is a second core in the multi-core processing, the multi-core processor comprising a plurality of cores;
  
  the CPU_nof the plurality of CPUs configured to detect a failure in CPU_n+1, detecting a failure comprising the steps of;
  
  retrieving the timestamp_n+1from the shared memory;
  
  comparing the timestamp_n+1to a timestamp_ngenerated by the CPU_nand determining a delta value;
  
  comparing the delta value with a threshold value and determining whether the CPU_n+1has failed; and
  
  in response to determining that the CPU_n+1has failed, initiating error handling for the plurality of CPUs.
- View Dependent Claims (11, 12, 13)
- - 11. The system of claim 10, wherein the shared memory and the plurality of CPUs are components of a Fibre Channel Storage Host Adapter.
  - 12. The system of claim 11, wherein the threshold value is set lower than a system threshold value for the Fibre Channel Storage Host Adapter.
  - 13. The system of claim 10, wherein the CPU_nis configured to read the timestamp_nfrom hardware and write the timestamp_nto the global array.

14. A method for detecting processor failure, the method comprising:
- retrieving a timestamp_n+1from a shared memory that is shared by a plurality of central processing units (CPUs), wherein the plurality of CPUs are configured in a ring and each CPU_ndetermines whether a CPU_n+1that is logically adjacent to the CPU_nin the ring has failed, the timestamp_n+1is written to the shared memory by the CPU_n+1, wherein the CPU_nis a first core in a multi-core processor and the CPU_n+1is a second core in a multi-core processor, the multi-core processor comprising a plurality of cores;
  
  comparing by the CPU_n, the timestamp_n+1to a timestamp_ngenerated by the CPU_nchecking the CPU_n+1for failure; and
  
  in response to the difference between timestamp_n+1and timestamp_nbeing larger than a threshold value, the CPU_ndetermining that there is a failure on CPU_n+1and initiating error handling for the plurality of CPUs.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The method of claim 14, further comprising reading the timestamp_nfrom hardware and writing the timestamp_nto the shared memory.
  - 16. The method of claim 14, wherein the CPU_n+1is not a master CPU and the CPU_nis not the master CPU, and wherein initiating error handling comprises the CPU_nnotifying the master CPU of the failure on CPU_n+1, and wherein the master CPU causes the plurality of CPUs to perform error handling.
  - 17. The method of claim 14, wherein the CPU_n+1is a master CPU, the method further comprising:
    - sending a non-critical interrupt to CPU_n+1;
      
      sending a critical interrupt to CPU_n+1in response to the CPU_n+1failing to respond to the non-critical interrupt; and
      
      broadcasting a group non-critical interrupt to all CPUs in response the CPU_n+1failing to respond to the critical interrupt, wherein the group non-critical interrupt causes the CPUs to perform error handling.
  - 18. The method of claim 14, wherein each of the plurality of CPUs has a dedicated cache line in the shared memory for writing timestamps.
  - 19. The method of claim 14, further comprising adding additional time to the timestamp_nprior to comparing the timestamp_n+1to the timestamp_n.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Cardinell, Charles S., Hathorn, Roger G., Laubli, Bernhard, Van Patten, Timothy J.
Primary Examiner(s)
TRUONG, LOAN

Application Number

US12/902,501
Publication Number

US 20120089861A1
Time in Patent Office

1,449 Days
Field of Search

714/2, 714/10, 714/4.2, 714/11, 714/47.1, 714/47.2
US Class Current

714/10
CPC Class Codes

G06F 11/0724 in a multiprocessor or a mu...

G06F 11/0757 by exceeding a time limit, ...

Inter-processor failure detection and recovery

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

43 Citations

19 Claims

Specification

Use Cases

Quick Links

Others

Inter-processor failure detection and recovery

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

43 Citations

19 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others