Fault-tolerant computer system with online recovery and reintegration of redundant components
First Claim
1. In a computer system having a plurality of Central Processor Units (CPUs) and a plurality of components coupled to the CPUs, a method comprising:
- detecting an interrupt indicating a fault in a system component always designated as a primary system component, the primary system component having at least one matching redundant component always designated as a backup component of the primary system component, where a redundant component receives, during normal operation, the same input data as the primary system component, but, during normal operation, does not output the data output by the primary system component;
isolating the fault in the primary system component;
taking the faulty primary system component off-line while maintaining system operations and, without affecting systems operation, using the redundant system component matching the faulty primary system component, where the redundant system component receives and outputs data that the faulty primary system component would have received and output, respectively;
upon repair or replacement of the faulty primary system component, reinitializing the repaired or replacement primary system component;
initiating a test procedure in the repaired or replacement primary system component;
reintegrating the repaired or replacement primary system component if the primary system component passes the test procedure; and
placing the repaired or replacement primary system component online if the reintegration step is successfully completed.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer system in a fault-tolerant configuration employs multiple identical CPUs executing the same instruction stream, with multiple, identical memory modules in the address space of the CPUs storing duplicates of the same data. The system detects faults in the CPUs and memory modules, and places a faulty unit offline while continuing to operate using the good units. The faulty unit can be replaced and reintegrated into the system without shutdown. The multiple CPUs are loosely synchronized, as by detecting events such as memory references and stalling any CPU ahead of others until all execute the function simultaneously; interrupts can be synchronized by ensuring that all CPUs implement the interrupt at the same point in their instruction stream. Memory references via the separate CPU-to-memory busses are voted at the three separate ports of each of the memory modules. I/O functions are implemented using two identical I/O busses, each of which is separately coupled to only one of the memory modules. A number of I/O processors are coupled to both I/O busses. I/O devices are accessed through a pair of identical (redundant) processors, but only one is designated to actively control a given device; in case of failure of one I/O processor, however, an I/O device can be accessed by the other one without system shutdown.
-
Citations
14 Claims
-
1. In a computer system having a plurality of Central Processor Units (CPUs) and a plurality of components coupled to the CPUs, a method comprising:
-
detecting an interrupt indicating a fault in a system component always designated as a primary system component, the primary system component having at least one matching redundant component always designated as a backup component of the primary system component, where a redundant component receives, during normal operation, the same input data as the primary system component, but, during normal operation, does not output the data output by the primary system component; isolating the fault in the primary system component; taking the faulty primary system component off-line while maintaining system operations and, without affecting systems operation, using the redundant system component matching the faulty primary system component, where the redundant system component receives and outputs data that the faulty primary system component would have received and output, respectively; upon repair or replacement of the faulty primary system component, reinitializing the repaired or replacement primary system component; initiating a test procedure in the repaired or replacement primary system component; reintegrating the repaired or replacement primary system component if the primary system component passes the test procedure; and placing the repaired or replacement primary system component online if the reintegration step is successfully completed. - View Dependent Claims (2, 3, 4)
-
-
5. A fault-tolerant computing system, comprising:
-
a plurality of Central Processor Units (CPUs); a first system component always designated as a primary component and coupled to the plurality of CPUs for performing a function in the computing system and for inputting and outputting data; a second system component always designated as a backup component of the primary component and coupled to the plurality of CPUs, that receives, during normal operation, the same input data that the first system component receives, but that does not, during normal operation, output data when the first system component outputs data, the second system component therefore functioning as a redundant component; means for detecting an interrupt indicating a fault in the first system component; means for isolating the fault in the first system component; and means for taking the first system component off-line while maintaining system operations and without affecting systems operation and, while continuing to use the primary component, using the second system component to perform the function of the first system component, where the second system component receives and outputs data that the first system component would have received and output, respectively, in response to data requests. - View Dependent Claims (6, 7)
-
-
8. A fault-tolerant computing system, comprising:
-
a plurality of Central Processor Units (CPUs); a first memory module always designated as a primary memory module and coupled to the plurality of CPUs for performing a function in the computing system and for inputting and outputting data; a second memory module always designated as a backup memory module of the primary memory module and coupled to the plurality of CPUs, that receives the same input data that the first memory module receives, but that does not, during normal operation, output data when the first memory module outputs data, the second memory module therefore functioning as a redundant component; means for detecting an interrupt indicating a fault in the first memory module; means for isolating the fault in the first memory module; and means for taking the first memory module off-line while maintaining system operations and without affecting systems operation and, while continuing to use the primary component, using the second memory module to perform the function of the first memory module, where the second memory module receives and outputs data that the first memory module would have received and output, respectively, in response to data requests. - View Dependent Claims (9)
-
-
10. A computer program product, comprising:
a computer usable medium having computer readable code embodied therein for performing online recovery and reintegration of system components in a computer system having a plurality of Central Processor Units (CPUs) and a plurality of components coupled to the CPUs, the computer program product comprising; computer readable code configured to cause a computer to effect detecting an interrupt indicating a fault in a system component always designated as a primary system component, the primary system component having at least one matching redundant component always designated as a backup component of the primary system component, where a redundant component receives, during normal operation, the same input data as the primary system component, but, during normal operation, does not output the data output by the primary system component; computer readable code configured to cause a computer to effect isolating the fault in the primary system component; computer readable code configured to cause a computer to effect taking the faulty primary system component off-line while maintaining system operations and without affecting systems operation and, while continuing to use the primary component, using the redundant system component matching the faulty primary system component, where the redundant system component receives and outputs data that the faulty primary system component would have received and output, respectively; computer readable code configured to cause a computer to effect, upon repair or replacement of the faulty primary system component, reinitializing the repaired or replacement primary system component; computer readable code configured to cause a computer to effect initiating a test procedure in the repaired or replacement primary system component; computer readable code configured to cause a computer to effect reintegrating the repaired or replacement primary system component if the primary system component passes the test procedure; and computer readable code configured to cause a computer to effect placing the repaired or replacement primary system component online if the reintegration step is successfully completed. - View Dependent Claims (11, 12, 13)
-
14. In a computer system having a plurality of Central Processor Units (CPUs) and a plurality of components coupled to the CPUs, a method comprising:
-
detecting an interrupt indicating a fault in one of a primary system component and a redundant system component, where the redundant component receives, during normal operation, the same input data as the primary system component, but, during normal operation, does not output the data output by the primary system component; reading, by each of the plurality of CPUs, a respective interrupt cause register; voting their interrupt cause registers, by each of the plurality of CPUs; in accordance with the voting step, taking the faulty system component off-line without affecting system operations, the remaining, non-faulty system component handling both the read and write operation; upon repair or replacement of the faulty system component, reinitializing the repaired or replacement system component; initiating a test procedure in the repaired or replacement component; reintegrating the repaired or replacement system component if the system component passes the test procedure; and placing the repaired or replacement system component online if the reintegration step is successfully completed.
-
Specification