Method and apparatus for managing redundant computer-based systems for fault tolerant computing

US 6,178,522 B1
Filed: 08/25/1998
Issued: 01/23/2001
Est. Priority Date: 06/02/1998
Status: Expired due to Term

First Claim

Patent Images

1. A method for managing redundancy computer based systems having multiple hardware computing nodes comprising the steps of:

providing a corresponding redundancy management system (RMS) to each computing node;

establishing a communication link between each RMS;

implementing a fault tolerant executive (FTE) module in each RMS for managing faults and a plurality of system functions;

defining each computing node as a fault containment region;

detecting faults/errors in data generated in a computing node, said detecting comprising the step of voting on data generated by each node to determine whether data generated by one node is different from a voted majority; and

isolating a detected fault within the fault containment region to prevent propagation into another computing node, said isolating comprising using the step of voting on data as an output to mask a fault when data generated by a particular node is different from the voted majority.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A stand alone Redundancy Management System (RMS) provides a cost-effective solution for managing redundant computer-based systems in order to achieve ultra-high system reliability, safety, fault tolerance, and mission success rate. The RMS includes a Cross Channel Data Link (CCDL) module and a Fault Tolerant Executive (FE) module. The CCDL module provides data communication for all channels, while the FTE module performs system functions such as synchronization, data voting, fault and error detection, isolation and recovery. System fault tolerance is achieved by detecting and masking erroneous data through data voting, and system integrity is ensured by a dynamically reconfigurable architecture that is capable of excluding faulty nodes from the system and re-admitting healthy nodes back into the system.

Citations

9 Claims

1. A method for managing redundancy computer based systems having multiple hardware computing nodes comprising the steps of:
- providing a corresponding redundancy management system (RMS) to each computing node;
  
  establishing a communication link between each RMS;
  
  implementing a fault tolerant executive (FTE) module in each RMS for managing faults and a plurality of system functions;
  
  defining each computing node as a fault containment region;
  
  detecting faults/errors in data generated in a computing node, said detecting comprising the step of voting on data generated by each node to determine whether data generated by one node is different from a voted majority; and
  
  isolating a detected fault within the fault containment region to prevent propagation into another computing node, said isolating comprising using the step of voting on data as an output to mask a fault when data generated by a particular node is different from the voted majority.

2. A method for managing redundancy computer-based systems having multiple hardware computing nodes comprising the steps of:
- providing a corresponding redundancy management system (RMS) to each computing node;
  
  establishing a communication link between each RMS;
  
  implementing a fault tolerant executive (FTE) module in each RMS for managing faults and a plurality of system functions;
  
  detecting faults/errors in data generated in a node and preventing propagation of a detected fault/error in data generated in a node;
  
  said steps of detecting and preventing comprising the steps of voting on data generated by each node to determine whether data generated by one node is different from a majority; and
  
  using the voted data as an output to mask a fault when data generated by a particular node is different from the voted majority;
  
  identifying a faulty node in response to the result of data voting;
  
  penalizing the identified faulty node by a global penalty system; and
  
  excluding the identified faulty node from an operating set of nodes when the faulty node'"'"'s penalties exceed a user specified fault tolerance range.
- View Dependent Claims (3, 4)
- - 3. The method as claimed in claim 2, further comprising the steps of:
4. The method as claimed in claim 3, wherein the predetermined threshold is defined by a system operator.

5. A method for fault tolerant computing in computing environments having a plurality of computing nodes, comprising the steps of:
- implementing a corresponding redundancy management system (RMS) for each computing node independent from applications;
  
  communicating between each RMS; and
  
  maintaining an operating step (OPS) of nodes for increasing fault tolerance of the computing environment, said set of maintaining being performed in a fault tolerant executive (FTE) resident in the RMS and further comprises the steps of;
  
  receiving data at each RMS from every node connecting in the computing environment;
  
  determining at each RMS whether data received from any one node contains faults;
  
  excluding a node which generated data that is faulty with respect to other received data; and
  
  re-configuring the operating set to not include the faulty node;
  
  said step of determining further comprising the steps of;
  
  setting a tolerance range for faulty data;
  
  voting on all received data from each node; and
  
  identifying a node having faulty data that exceeds the set tolerance range.
- View Dependent Claims (7)
- - 7. The method as claimed in claim 5, wherein said step of voting is performed at every minor frame boundary in the data transmission.

6. A method for fault tolerant computing in computing environments having a plurality of computing nodes, comprising the steps of:
- implementing a corresponding redundancy management system (RMS) for each computing node independent from applications;
  
  communicating between each RMS;
  
  maintaining an operating set (OPS) of nodes for increasing fault tolerance of the computing environment, said step of maintaining being performed in a fault tolerant executive (FTE) resident in the RMS and further comprising the steps of;
  
  receiving data at each RMS from every node connected in the computing environment;
  
  determining at each RMS whether data received from any one node contains faults; and
  
  reconfiguring the operating set to not include the faulty node;
  
  monitoring data on the excluded node; and
  
  re-admitting the excluded node into the operating set when the monitored data indicates the correction of the faulty data on the excluded node.

8. A method for fault tolerant computing in computing environments having a plurality of computing nodes, comprising the steps of:
- implementing a corresponding redundancy management system (RMS) for each computing node independent from applications;
  
  communicating between each RMS; and
  
  maintaining an operating set (OPS) of nodes for increasing fault tolerance of the computing environments said step of maintaining being performed in a fault tolerant executive (FTE) resident in the RMS and comprising the steps of;
  
  receiving data at each RMS from every node connected in the computing environment;
  
  determining at each RMS whether data received from any one node contains faults;
  
  excluding a node which generated data that is faulty with respect to other received data; and
  
  reconfiguring the operating set to not include the faulty node, said step of reconfiguring being performed at every major frame boundary in the data transmission.

9. An apparatus for managing redundancy computer-based systems having multiple hardware computing nodes comprising:
- means for providing a corresponding redundancy management system (RMS) to each computing node;
  
  means for establishing a communication link between each RMS comprising a cross channel data link connected to each redundancy management system in each computing node;
  
  means for implementing a fault tolerant executive (FTE) module in each RMS for managing faults and a plurality of system functions;
  
  means for detecting faults/errors in data generated in any one node, said detecting means comprising means for voting on data generated by each node for determining whether data generated by one node is different from a voted majority;
  
  , means for isolating a detected fault/error when the node from which the fault/error was generated, said isolating means comprising means for using the voted data to mask a fault generated by one node that is different from the voted majority;
  
  means for penalizing an identified faulty node by a global penalty system; and
  
  means for excluding the identified faulty node from an operating set of nodes when the faulty node'"'"'s penalties exceed a user specified fault tolerance range.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Alliedsignal Inc. (Honeywell International Inc.)
Original Assignee
Alliedsignal Inc. (Honeywell International Inc.)
Inventors
Ernst, James W., Bolduc, Louis P., Peng, Dar-Tzen, Zhou, Jeffrey Xiaofeng, Roden, Thomas Gilbert III, Younis, Mohamed
Primary Examiner(s)
LE, DIEU MINH T

Application Number

US09/140,174
Time in Patent Office

882 Days
Field of Search

714/12, 714/10, 714/11, 714/4, 714/7, 714/797, 714/798
US Class Current

714/12
CPC Class Codes

G06F 11/181   Eliminating the failing red...

G06F 11/182   based on mutual exchange of...

G06F 11/188   where exact match is not re...

Method and apparatus for managing redundant computer-based systems for fault tolerant computing

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for managing redundant computer-based systems for fault tolerant computing

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links