INTER-PROCESSOR FAILURE DETECTION AND RECOVERY
First Claim
1. An apparatus for detecting processor failure in a multi-processor device, the apparatus comprising:
- a retrieval module configured to retrieve a timestampn+1 from a shared memory that is shared by a plurality of central processing units (CPUs), wherein the timestampn+1 is written to the shared memory by a CPUn+1;
a comparison module configured to compare the timestampn+1 to a timestampn generated by a CPUn checking the CPUn+1 for failure and determine a delta value;
the comparison module further configured to compare the delta value with a threshold value and determine whether the CPUn+1 has failed; and
a detection module configured to, in response to the comparison module determining that the CPUn+1 has failed, initiate error handling for the plurality of CPUs.
1 Assignment
0 Petitions
Accused Products
Abstract
An approach to detecting processor failure in a multi-processor environment is disclosed. The approach may include having each CPU in the system responsible for monitoring another CPU in the system. A CPUn reads a timestampn+1 created by CPUn+1 which CPUn is monitoring from a shared memory location. The CPUn reads its own timestampn and compares the two timestamps to calculate a delta value. If the delta value is above a threshold, the CPUn determines that CPUn+1 has failed and initiates error handling for the CPUs in the system. One CPU may be designated a master CPU, and be responsible for beginning the error handling process. In such embodiments, the CPUn may initiate error handling by notifying the master CPU that CPUn+1 has failed. If CPUn+1 is the master CPU, the CPUn may take additional steps to initiate error handling, and may broadcast a non-critical interrupt to all CPUs, triggering error handling.
78 Citations
20 Claims
-
1. An apparatus for detecting processor failure in a multi-processor device, the apparatus comprising:
-
a retrieval module configured to retrieve a timestampn+1 from a shared memory that is shared by a plurality of central processing units (CPUs), wherein the timestampn+1 is written to the shared memory by a CPUn+1; a comparison module configured to compare the timestampn+1 to a timestampn generated by a CPUn checking the CPUn+1 for failure and determine a delta value; the comparison module further configured to compare the delta value with a threshold value and determine whether the CPUn+1 has failed; and a detection module configured to, in response to the comparison module determining that the CPUn+1 has failed, initiate error handling for the plurality of CPUs. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for detecting processor failure in a multi-processor device, the system comprising:
-
a shared memory that is shared by a plurality of central processing units (CPUs); a CPUn+1 of the plurality of CPUs configured to write a timestampn+1 to a global array in the shared memory; a CPUn of the plurality of CPUs configured to detect a failure in CPUn+1, detecting a failure comprising the steps of; retrieving the timestampn+1 from the shared memory; comparing the timestampn+1 to a timestampn generated by the CPUn and determining a delta value; comparing the delta value with a threshold value and determining whether the CPUn+1 has failed; and in response to determining that the CPUn+1 has failed, initiating error handling for the plurality of CPUs. - View Dependent Claims (11, 13, 14)
-
-
12. The system of claim 12, wherein the threshold value is set lower than a system threshold value for the Fibre Channel Storage Host Adapter.
-
15. A method for detecting processor failure in a multi-processor device, the method comprising:
-
retrieving a timestampn+1 from a shared memory that is shared by a plurality of central processing units (CPUs), wherein the timestampn+1 is written to the shared memory by a CPUn+1; comparing the timestampn+1 to a timestampn generated by a CPUn checking the CPUn+1 for failure; and in response to the difference between timestampn+1 and timestampn being larger than a threshold value, determining that there is a failure on CPUn+1 and initiating error handling for the plurality of CPUs. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification