Fault-tolerant computer system with online recovery and reintegration of redundant components
First Claim
1. A method of operating a computer system having multiple CPUs executing the same instruction stream, the CPUs each having local memory and also each accessing multiple global memory units storing identical data, comprising the steps of:
- a) detecting an error in one of said CPUs;
b) isolating said one CPU from the system and continuing to execute said instruction stream and accessing said global memory units by the other ones of said CPUs;
c) reintegrating said one CPU after rendering said CPU operative by first bringing said one CPU into sync with said other ones of said CPUs by soft-resetting all of said multiple CPUs prior to continuing normal operation of said multiple CPUs, said soft-resetting non-destructively preserving the current state and the local memory of each said multiple CPU, then restoring the state and the local memory of said one CPU to be identical to the state and the local memory of the said other ones of the CPUs.
5 Assignments
0 Petitions
Accused Products
Abstract
A computer system in a fault-tolerant configuration employs multiple identical CPUs executing the same instruction stream, with multiple, identical memory modules in the address space of the CPUs storing duplicates of the same data. The system detects faults in the CPUs and memory modules, and places a faulty unit offline while continuing to operate using the good units. The faulty unit can be replaced and reintegrated into the system without shutdown. The multiple CPUs are loosely synchronized, as by detecting events such as memory references and stalling any CPU ahead of others until all execute the function simultaneously; interrupts can be synchronized by ensuring that all CPUs implement the interrupt at the same point in their instruction stream. Memory references via the separate CPU-to-memory busses are voted at the three separate ports of each of the memory modules. I/O functions are implemented using two identical I/O busses, each of which is separately coupled to only one of the memory modules. A number of I/O processors are coupled to both I/O busses. I/O devices are accessed through a pair of identical (redundant) I/O processors, but only one is designated to actively control a given device; in case of failure of one I/O processor, however, an I/O device can be accessed by the other one without system shutdown.
389 Citations
29 Claims
-
1. A method of operating a computer system having multiple CPUs executing the same instruction stream, the CPUs each having local memory and also each accessing multiple global memory units storing identical data, comprising the steps of:
-
a) detecting an error in one of said CPUs; b) isolating said one CPU from the system and continuing to execute said instruction stream and accessing said global memory units by the other ones of said CPUs; c) reintegrating said one CPU after rendering said CPU operative by first bringing said one CPU into sync with said other ones of said CPUs by soft-resetting all of said multiple CPUs prior to continuing normal operation of said multiple CPUs, said soft-resetting non-destructively preserving the current state and the local memory of each said multiple CPU, then restoring the state and the local memory of said one CPU to be identical to the state and the local memory of the said other ones of the CPUs. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A fault-tolerant computer system, comprising:
-
a) first, second and third CPUs of substantially identical configuration each having local memory, said first, second and third CPUs executing substantially the same instruction stream; b) first and second global memory modules of substantially identical configuration, said first and second memory modules storing substantially the same data; c) busses coupling each of the first, second and third CPUs individually to each of said first and second global memory modules whereby said first, second and third CPUs access said first and second global memory modules separately and in duplicate; d) said CPUs continuing to execute said instruction stream even though one of said first, second and third CPUs is inoperative and continuing to access one of said first and second global memory modules even though the other is inoperative; e) said one of said first, second and third CPUs which is inoperative being replaceable into the system without shutdown of the system while the other ones of said CPUs continue execution of said instruction stream; f) said one of said first, second and third CPUs which is inoperative being rendered operative and restored to normal function in the system without shutdown of the system while the other ones of said CPUs continue execution of said instruction stream, all of said first, second and third CPUs being soft-reset prior to restoration of said inoperative CPU, said soft-reset non-destructively preserving the current state and local memory of said first, second and third CPUs; g) said other of the global memory modules which is inoperative being replaceable into the system without shutdown of the system while said first, second and third CPUs continue to access the global memory module which is operative; h) said other of the global memory modules which is inoperative being rendered operative and restored to normal function in the system without shutdown of the system while said first, second and third CPUs continue to access the global memory module which is operative. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A method of operating a computer system including the steps of:
-
a) executing the same instruction stream in first, second and third CPUs; b) generating global memory accesses in each of said first, second and third CPUs at separate first, second and third global memory access busses; c) storing duplicative data in first and second global memory modules having substantially identical address spaces within the address range of said CPUs, including executing accesses to each one of said first and second global memory modules via said first, second and third global memory access busses; d) voting each one of said accesses in said first and second global memory modules when received from said first, second and third global memory access busses, said voting including comparing information representing said accesses; e) allowing said accesses to be completed only where at least two of said global memory access busses present the same such information; f) placing offline one of said first, second and third CPUs when a global memory access from said one is different from the other two upon said voting, then placing said one CPU back online without shutdown of the system after said one of the CPUs is rendered operative, said first, second and third CPUs being soft-reset such that the current state and local memory of each of said first, second and third CPUs are non-destructively preserved prior to continuing normal operation of said first, second and third CPUs. - View Dependent Claims (27, 28, 29)
-
Specification