Inter-processor failure detection and recovery
First Claim
1. An apparatus comprising:
- non-transitory computer readable storage medium storing computer readable prongram code executable by a plurality of centaral processing units (CPU), wherein the plurality of CPUs are configured in a ring and each CPUn determines whether a CPUn+1 that is logically adjacent to the CPUn in the ring has failed, the computer readable program code comprising;
a retrieval module of the CPUn configured to retrieve a timestampn+1 from a shared memory that is shared by the plurality of CPUs, wherein the timestampn+1 is written to the shared memory by the CPUn+1, wherein the CPUn is a first core in a multi-core processor and the CPUn+1 is a second core in a multi-core processor, the multi-core processor comprising a plurality of cores;
a comparison module of the CPUn configured to compare the timestampn+1 to a timestampn generated by a CPUn checking the CPUn+1 for failure and determine a delta value;
the comparison module of the CPUn further configured to compare the delta value with a threshold value and determine whether the CPUn+1 has failed; and
a detection module of the CPUn configured to, in response to the comparison module determining that the CPUn+1 has failed, initiate error handling for the plurality of CPUs.
1 Assignment
0 Petitions
Accused Products
Abstract
An approach to detecting processor failure in a multi-processor environment is disclosed. The approach may include having each CPU in the system responsible for monitoring another CPU in the system. A CPUn reads a timestampn+1 created by CPUn+1 which CPUn is monitoring from a shared memory location. The CPUn reads its own timestampn and compares the two timestamps to calculate a delta value. If the delta value is above a threshold, the CPUn determines that CPUn+1 has failed and initiates error handling for the CPUs in the system. One CPU may be designated a master CPU, and be responsible for beginning the error handling process. In such embodiments, the CPUn may initiate error handling by notifying the master CPU that CPUn+1 has failed. If CPUn+1 is the master CPU, the CPUn may take additional steps to initiate error handling, and may broadcast a non-critical interrupt to all CPUs, triggering error handling.
43 Citations
19 Claims
-
1. An apparatus comprising:
-
non-transitory computer readable storage medium storing computer readable prongram code executable by a plurality of centaral processing units (CPU), wherein the plurality of CPUs are configured in a ring and each CPUn determines whether a CPUn+1 that is logically adjacent to the CPUn in the ring has failed, the computer readable program code comprising; a retrieval module of the CPUn configured to retrieve a timestampn+1 from a shared memory that is shared by the plurality of CPUs, wherein the timestampn+1 is written to the shared memory by the CPUn+1, wherein the CPUn is a first core in a multi-core processor and the CPUn+1 is a second core in a multi-core processor, the multi-core processor comprising a plurality of cores; a comparison module of the CPUn configured to compare the timestampn+1 to a timestampn generated by a CPUn checking the CPUn+1 for failure and determine a delta value; the comparison module of the CPUn further configured to compare the delta value with a threshold value and determine whether the CPUn+1 has failed; and a detection module of the CPUn configured to, in response to the comparison module determining that the CPUn+1 has failed, initiate error handling for the plurality of CPUs. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system comprising:
-
a shared memory that is shared by a plurality of central processing units (CPUs), wherein the plurality of CPUs are configured in a ring and each CPUn determines whether a CPUn+1 that is logically adjacent to the CPUn in the ring has failed; the CPUn+1 of the plurality of CPUs configured to write a timestampn+1 to a global array in the shared memory, wherein the CPUn is a first core in a multi-core processor and the CPUn+1 is a second core in the multi-core processing, the multi-core processor comprising a plurality of cores; the CPUn of the plurality of CPUs configured to detect a failure in CPUn+1, detecting a failure comprising the steps of; retrieving the timestampn+1 from the shared memory; comparing the timestampn+1 to a timestampn generated by the CPUn and determining a delta value; comparing the delta value with a threshold value and determining whether the CPUn+1 has failed; and in response to determining that the CPUn+1 has failed, initiating error handling for the plurality of CPUs. - View Dependent Claims (11, 12, 13)
-
-
14. A method for detecting processor failure, the method comprising:
-
retrieving a timestampn+1 from a shared memory that is shared by a plurality of central processing units (CPUs), wherein the plurality of CPUs are configured in a ring and each CPUn determines whether a CPUn+1 that is logically adjacent to the CPUn in the ring has failed, the timestampn+1 is written to the shared memory by the CPUn+1, wherein the CPUn is a first core in a multi-core processor and the CPUn+1 is a second core in a multi-core processor, the multi-core processor comprising a plurality of cores; comparing by the CPUn, the timestampn+1 to a timestampn generated by the CPUn checking the CPUn+1 for failure; and in response to the difference between timestampn+1 and timestampn being larger than a threshold value, the CPUn determining that there is a failure on CPUn+1 and initiating error handling for the plurality of CPUs. - View Dependent Claims (15, 16, 17, 18, 19)
-
Specification