Error self-checking and recovery using lock-step processor pair architecture
First Claim
1. In a processing system having a Master processor unit, a Shadow processor unit, and a memory, the Master and Shadow processor units each executing an instruction stream that is identical to the other, a method for fault tolerant operation of the processing system that includes the steps of:
- sending Master processor address and data signals to the memory;
sending the Master processor address and data signals to the memory checker;
after the Master processor unit has sent the Master processor address and data signals to the memory, checking the Master processor address and data signals against Shadow processor address and data signals communicated by the Shadow processor unit in order to assert a diverge signal if a mismatch is detected;
the Master processor unit checking to see if the Master processor unit or the Shadow processor unit experienced an error when the diverge signal is asserted;
halting processor operation if the Master processor determines that the error causing the mismatch is one from which recovery is not possible;
otherwise, saving processor state and data of the Master processor unit to the memory; and
restoring the saved state to the Master and Shadow processor units.
4 Assignments
0 Petitions
Accused Products
Abstract
A logical processor is formed from a pair of processor units operating in close synchrony to perform self-check operations. Outputs of one of the processor units are compared to that of the other processor unit. When one of the processor units experiences an error, creating a divergence, that error and/or divergence will be made known to the Master processor which will then determine if recovery from the error can be made and, if so, save its processing state to memory, cause a reset of both processor units to an initial state to begin executing reinitialization code using the prior saved state for both processor units.
204 Citations
13 Claims
-
1. In a processing system having a Master processor unit, a Shadow processor unit, and a memory, the Master and Shadow processor units each executing an instruction stream that is identical to the other, a method for fault tolerant operation of the processing system that includes the steps of:
-
sending Master processor address and data signals to the memory;
sending the Master processor address and data signals to the memory checker;
after the Master processor unit has sent the Master processor address and data signals to the memory, checking the Master processor address and data signals against Shadow processor address and data signals communicated by the Shadow processor unit in order to assert a diverge signal if a mismatch is detected;
the Master processor unit checking to see if the Master processor unit or the Shadow processor unit experienced an error when the diverge signal is asserted;
halting processor operation if the Master processor determines that the error causing the mismatch is one from which recovery is not possible;
otherwise, saving processor state and data of the Master processor unit to the memory; and
restoring the saved state to the Master and Shadow processor units. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
providing a timer;
starting the timer when a mismatch is detected; and
asserting, by the timer, a time-out signal when the predetermined time has elapsed.
-
-
8. The method of claim 7, further including the steps of:
-
presetting the timer with the predetermined time value when a mis-match is detected; and
the timer asserting a time-out signal when the predetermined time has elapsed.
-
-
9. In a processing system having a Master processor unit, a Shadow processor unit, and a memory, the Master and Shadow processor units each executing an instruction stream that is identical to the other, a method for fault tolerant operation of the processing system that includes the steps of:
-
sending Master processor address and data signals to the memory;
sending the Master processor address and data signals to the memory checker;
after the Master processor unit has sent the Master processor address and data signals to the memory, checking the Master processor address and data signals against Shadow processor address and data signals communicated by the Shadow processor unit in order to assert a diverge signal if a mismatch is detected;
the Master processor unit checking to see if the Master processor unit or the Shadow processor unit experienced an error when the diverge signal is asserted;
saving processor state and data of the Master processor unit to the memory if available, or alternatively, saving the processor state and data of the Shadow processor unit to the memory; and
resetting the Master and Shadow processor units by restoring the saved state to the Master and Shadow processor units.
-
-
10. An article of manufacture comprising a Master processor, a Shadow processor, a memory and a checker to compare address and data signals of the Master processor to address and data signals of the Shadow processor in order to assert a diverge signal if a mismatch is detected, the memory including a computer program for causing the Master and the Shadow processor units to each execute an identical instruction stream and to cause the Master processor to tolerate faults by,
determining if the Master processor or the Shadow processor experienced an error when the diverge signal is asserted; -
halting operation of the Master and the Shadow processors if the error is determined to be one from which recovery is not possible;
otherwise, saving processor state and data of the Master processor to the memory; and
resetting the Master and the Shadow processor units to resume operation after restoring the saved processor state and data to the Master and the Shadow processors;
wherein the checker conducts the comparison between the address and data signals of the Master processor and the address and data signals of the Shadow processor after the Master processor address and data signals have been sent to the memory.
-
-
11. A computer system, comprising:
-
a memory;
a Master processor unit and a Shadow processor unit for generating address and data signals and each coupled to the memory for receiving from the memory an instruction stream for execution;
a checker element coupled to receive and compare the address signals of the Master processor unit with hose of the Shadow processor unit in order to assert a diverge signal if a mis-match is detected, the checker element conducting the comparison after the Master processor address signals have been sent to the memory;
the instruction stream including a computer program for causing the Master processor unit to tolerate faults by, determining if the Master processor unit or the Shadow processor unit experienced an error when the diverge signal is asserted;
halting operation of the Master and the Shadow processor units if the error is determined to be one from which recovery is not possible;
otherwise, saving the processor state and data of the Master processor unit to the memory; and
resetting the Master and the Shadow processor units to resume operation after restoring the saved processor state and data to the Master and the Shadow processor units.
-
-
12. A computer system, comprising:
-
a memory;
a Master and a Shadow processor units each for communicating address and data signals and for receiving from the memory an instruction stream for execution;
a checker element coupled to receive and compare the address signals of the Master processor unit with those of the Shadow processor unit to assert a diverge signal if a mismatch is detected, the checker element conducting the comparison after the Master processor address signals have been sent to the memory;
the instruction stream including a computer program for causing the Master processor unit to tolerate faults by, determining if the Master processor unit or the Shadow processor unit experienced an error when the diverge signal is asserted;
saving processor state and data of the Master processor unit to the memory if available, or alternatively, saving the processor state and data of the Shadow processor unit to the memory; and
resetting the Master and the Shadow processor units to resume operation after restoring the saved processor state and data to the Master and the Shadow processor units. - View Dependent Claims (13)
-
Specification