Method and apparatus for managing redundant computer-based systems for fault tolerant computing
First Claim
1. A method for managing redundancy computer based systems having multiple hardware computing nodes comprising the steps of:
- providing a corresponding redundancy management system (RMS) to each computing node;
establishing a communication link between each RMS;
implementing a fault tolerant executive (FTE) module in each RMS for managing faults and a plurality of system functions;
defining each computing node as a fault containment region;
detecting faults/errors in data generated in a computing node, said detecting comprising the step of voting on data generated by each node to determine whether data generated by one node is different from a voted majority; and
isolating a detected fault within the fault containment region to prevent propagation into another computing node, said isolating comprising using the step of voting on data as an output to mask a fault when data generated by a particular node is different from the voted majority.
1 Assignment
0 Petitions
Accused Products
Abstract
A stand alone Redundancy Management System (RMS) provides a cost-effective solution for managing redundant computer-based systems in order to achieve ultra-high system reliability, safety, fault tolerance, and mission success rate. The RMS includes a Cross Channel Data Link (CCDL) module and a Fault Tolerant Executive (FE) module. The CCDL module provides data communication for all channels, while the FTE module performs system functions such as synchronization, data voting, fault and error detection, isolation and recovery. System fault tolerance is achieved by detecting and masking erroneous data through data voting, and system integrity is ensured by a dynamically reconfigurable architecture that is capable of excluding faulty nodes from the system and re-admitting healthy nodes back into the system.
-
Citations
9 Claims
-
1. A method for managing redundancy computer based systems having multiple hardware computing nodes comprising the steps of:
-
providing a corresponding redundancy management system (RMS) to each computing node;
establishing a communication link between each RMS;
implementing a fault tolerant executive (FTE) module in each RMS for managing faults and a plurality of system functions;
defining each computing node as a fault containment region;
detecting faults/errors in data generated in a computing node, said detecting comprising the step of voting on data generated by each node to determine whether data generated by one node is different from a voted majority; and
isolating a detected fault within the fault containment region to prevent propagation into another computing node, said isolating comprising using the step of voting on data as an output to mask a fault when data generated by a particular node is different from the voted majority.
-
-
2. A method for managing redundancy computer-based systems having multiple hardware computing nodes comprising the steps of:
-
providing a corresponding redundancy management system (RMS) to each computing node;
establishing a communication link between each RMS;
implementing a fault tolerant executive (FTE) module in each RMS for managing faults and a plurality of system functions;
detecting faults/errors in data generated in a node and preventing propagation of a detected fault/error in data generated in a node;
said steps of detecting and preventing comprising the steps ofvoting on data generated by each node to determine whether data generated by one node is different from a majority; and
using the voted data as an output to mask a fault when data generated by a particular node is different from the voted majority;
identifying a faulty node in response to the result of data voting;
penalizing the identified faulty node by a global penalty system; and
excluding the identified faulty node from an operating set of nodes when the faulty node'"'"'s penalties exceed a user specified fault tolerance range. - View Dependent Claims (3, 4)
monitoring data on the excluded node to determine whether the excluded node qualifies for re-admission into an operating set; and
re-admitting the excluded node into the operating set when the monitoring indicates acceptable performance of the node within a predetermined threshold.
-
-
4. The method as claimed in claim 3, wherein the predetermined threshold is defined by a system operator.
-
5. A method for fault tolerant computing in computing environments having a plurality of computing nodes, comprising the steps of:
-
implementing a corresponding redundancy management system (RMS) for each computing node independent from applications;
communicating between each RMS; and
maintaining an operating step (OPS) of nodes for increasing fault tolerance of the computing environment, said set of maintaining being performed in a fault tolerant executive (FTE) resident in the RMS and further comprises the steps of;
receiving data at each RMS from every node connecting in the computing environment;
determining at each RMS whether data received from any one node contains faults;
excluding a node which generated data that is faulty with respect to other received data; and
re-configuring the operating set to not include the faulty node;
said step of determining further comprising the steps of;
setting a tolerance range for faulty data;
voting on all received data from each node; and
identifying a node having faulty data that exceeds the set tolerance range. - View Dependent Claims (7)
-
-
6. A method for fault tolerant computing in computing environments having a plurality of computing nodes, comprising the steps of:
-
implementing a corresponding redundancy management system (RMS) for each computing node independent from applications;
communicating between each RMS;
maintaining an operating set (OPS) of nodes for increasing fault tolerance of the computing environment, said step of maintaining being performed in a fault tolerant executive (FTE) resident in the RMS and further comprising the steps of;
receiving data at each RMS from every node connected in the computing environment;
determining at each RMS whether data received from any one node contains faults; and
reconfiguring the operating set to not include the faulty node;
monitoring data on the excluded node; and
re-admitting the excluded node into the operating set when the monitored data indicates the correction of the faulty data on the excluded node.
-
-
8. A method for fault tolerant computing in computing environments having a plurality of computing nodes, comprising the steps of:
-
implementing a corresponding redundancy management system (RMS) for each computing node independent from applications;
communicating between each RMS; and
maintaining an operating set (OPS) of nodes for increasing fault tolerance of the computing environments said step of maintaining being performed in a fault tolerant executive (FTE) resident in the RMS and comprising the steps of;
receiving data at each RMS from every node connected in the computing environment;
determining at each RMS whether data received from any one node contains faults;
excluding a node which generated data that is faulty with respect to other received data; and
reconfiguring the operating set to not include the faulty node, said step of reconfiguring being performed at every major frame boundary in the data transmission.
-
-
9. An apparatus for managing redundancy computer-based systems having multiple hardware computing nodes comprising:
-
means for providing a corresponding redundancy management system (RMS) to each computing node;
means for establishing a communication link between each RMS comprising a cross channel data link connected to each redundancy management system in each computing node;
means for implementing a fault tolerant executive (FTE) module in each RMS for managing faults and a plurality of system functions;
means for detecting faults/errors in data generated in any one node, said detecting means comprising means for voting on data generated by each node for determining whether data generated by one node is different from a voted majority;
,means for isolating a detected fault/error when the node from which the fault/error was generated, said isolating means comprising means for using the voted data to mask a fault generated by one node that is different from the voted majority;
means for penalizing an identified faulty node by a global penalty system; and
means for excluding the identified faulty node from an operating set of nodes when the faulty node'"'"'s penalties exceed a user specified fault tolerance range.
-
Specification