Operations controller for a fault tolerant multiple node processing system

US 4,933,940 A
Filed: 05/12/1989
Issued: 06/12/1990
Est. Priority Date: 04/15/1987
Status: Expired due to Term

First Claim

Patent Images

1. In a multiple node fault tolerant processing system having a plurality of nodes wherein each node has an applications processor for executing a predetermined set of tasks and an operations controller for controlling its own node in coordination with all of the other nodes of said plurality of nodes through the exchange of inter-node messages and wherein said operations controller selects the tasks to be executed by the applications processor from said predetermined set of tasks, each operations controller having a plurality of subsystems including a message checker, a scheduler, a synchronizer and a voter, each of which is capable of detecting errors and generating internal error reports identifying each error detected, each operations controller further having at least two operating system states and operative to switch from one operating system state to another in response to the exclusion of a faulty node or the readmittance of a healthy node which changes the number of nodes operating in the processing system, a fault tolerator for said operations controller comprising:

a message memory storing the content of all inter-node messages received by said operations controller;

an error file storing the content of said internal error reports generated by said message checker, said scheduler, said synchronizer and said voter;

error handler means for storing said error reports in said error file and for generating a base penalty count for each node of said plurality of nodes from the content of said error file, said base penalty count being indicative of the operational status of the associated node, said error handler means further having means for determining which nodes are faulty and for excluding such faulty nodes from participating in the operation of said multiple node processing system, in coordination with all of the other nodes in the system, through the exchange of inter-node messages, said inter-node messages including error messages containing the content of said error file for a particular node and a base penalty count message containing said base penalty count of each node; and

interface means for storing all of the messages passed by the message checker in said message memory, for passing the identities of the faulty nodes to the scheduler and the synchronizer, and for passing all error reports to said error handler.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A fault tolerator for an operations controller of a multiple node fault tolerant processing system having a data memory for storing the content of all received error free messages, an error file for storing the content of all received inner node error reports, an error handler for generating a base penalty count for each node based on the content of the errors recorded in the error file and for excluding each node from the operation of the multiple node processing system whose base penalty count exceeds an exclusion threshold. The fault tolerator also includes a synchronizer interface for passing the selected fields of the received messages to a synchronizer, a scheduler interface for passing selected information to a scheduler, and a message interface which stores the error free messages in the data memory and passes the selected fields of the messages to the synchronizer.

74 Citations

View as Search Results

18 Claims

1. In a multiple node fault tolerant processing system having a plurality of nodes wherein each node has an applications processor for executing a predetermined set of tasks and an operations controller for controlling its own node in coordination with all of the other nodes of said plurality of nodes through the exchange of inter-node messages and wherein said operations controller selects the tasks to be executed by the applications processor from said predetermined set of tasks, each operations controller having a plurality of subsystems including a message checker, a scheduler, a synchronizer and a voter, each of which is capable of detecting errors and generating internal error reports identifying each error detected, each operations controller further having at least two operating system states and operative to switch from one operating system state to another in response to the exclusion of a faulty node or the readmittance of a healthy node which changes the number of nodes operating in the processing system, a fault tolerator for said operations controller comprising:
- a message memory storing the content of all inter-node messages received by said operations controller;
  
  an error file storing the content of said internal error reports generated by said message checker, said scheduler, said synchronizer and said voter;
  
  error handler means for storing said error reports in said error file and for generating a base penalty count for each node of said plurality of nodes from the content of said error file, said base penalty count being indicative of the operational status of the associated node, said error handler means further having means for determining which nodes are faulty and for excluding such faulty nodes from participating in the operation of said multiple node processing system, in coordination with all of the other nodes in the system, through the exchange of inter-node messages, said inter-node messages including error messages containing the content of said error file for a particular node and a base penalty count message containing said base penalty count of each node; and
  
  interface means for storing all of the messages passed by the message checker in said message memory, for passing the identities of the faulty nodes to the scheduler and the synchronizer, and for passing all error reports to said error handler.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The fault tolerator of claim 1 wherein said interface means comprises:
    - synchronizer interface means for passing to the synchronizer selected fields of predetermined inter-node messages and the identity of the nodes excluded from the processing system generated by the error handler, and for passing to the error handler the error reports generated by the synchronizer;
      
      scheduler interface means for passing to the scheduler selected fields of predetermined inter-node messages and the identity of the nodes excluded from the processing system generated by the error handler and for passing to said error handler the error reports generated by the scheduler; and
      
      message checker interface means for storing in said message memory the messages passed by the message checker, for passing to said error handler the error report generated by the message checker and for passing to said scheduler interface means and said synchronizer interface means said selected fields of predetermined inter-node messages.
  - 3. The fault tolerator of claim 2 wherein said inter-node messages further include data value messages containing the data values generated by the applications processor, task interactive consistency messages containing a task completed vector identifying each node which completed a task during a predetermined time interval, a branch condition bit identifying which of two groups of successor tasks to the completed task are to be executed, and a task completed/started message containing a task identification code for the completed task, the branch condition bit of the completed task, and a task identification code for the started task, said message checker interface means comprising:
    - a context store storing for each node at least a flag bit indicating the receipt of a task interactive consistency message, the message type code of the received message, the branch condition bit from the task completed/started message and the task identification code of the task identified as completed and the task identified as started in the task completed/started message;
      
      task started detector means responsive to a task completed/started message for recording the task identification code of the completed task in said context store;
      
      branch condition detector means for recording the branch condition bit contained in a task completed/started message in said context store;
      
      task interactive consistency message detector means for setting the task interactive consistency flag bit in said context store for each node from which a task interactive consistency message was received;
      
      an error status buffer temporarily storing said error report received from the message checker;
      
      a branch condition register storing the branch condition bit contained in each task completed/started message;
      
      a task completed register storing the node identification codes of the nodes identified in each error free task completed/started messages;
      
      a scheduler buffer storing the task identification code of each task started and the node identification code of the node that started the task contained in each received task completed/started message; and
      
      a store message module having means for appending to each received message an address identifying where it is to be stored in the message memory, said address being derived in part from said node identification code for that message stored in said context store.
  - 4. The fault tolerator of claim 3, wherein said message memory has at least two partitions, each of said at least two partitions being capable of storing a different message received from the same node, and wherein said at least two partitions are individually identified by a context bit;
    - said context store further storing a context bit for each node identifying in which of said two partitions the message from that node is to be stored;
      
      said store message module further having means for appending said context bit to said address; and
      
      wherein said error report generated by the message checker includes an error status byte identifying all the errors detected by the message checker, said message checker interface means includes an error status byte detector to complement said context bit in response to said error status byte identifying the received message as error free.
  - 5. The fault tolerator of claim 4, wherein the complementing of said context bit by said error status byte detector activates said context store to load into said scheduler buffer the task identification code of the task started and the node identification code of the node that started the task, to load into the task completed register the node identification code of the node reported as having completed a task, and to load into the branch condition register the branch condition bit of the task reported as completed.
  - 6. The fault tolerator of claim 5, wherein the synchronization of the nodes and the scheduling of tasks to be executed by the applications processor are based on a fundamental timing interval, and wherein predetermined inter-node messages are to be sent by each node at the beginning of each of said fundamental timing intervals, said store message module further including means for disabling said store message module from storing messages received from each node which did not send said predetermined inter-node message at the beginning of said fundamental timing interval.
  - 7. The fault tolerator of claim 6, wherein said internode messages further include an inter-node base penalty count message containing a current base penalty count for each node of said plurality of nodes in the processing system, and an inter-node error message containing the identity of the node which sent a message having errors, the identity of each error detected by the node that originated the error message, the current base penalty count stored for the node which sent the message having detected errors, and an incremental base penalty count having a value corresponding to number and severity of the error detected, said error handler comprising:
    - an error handling context store temporarily storing predetermined information used by the error handler in the processing of reported errors;
      
      error filer means for assembling said error reports received from all the subsystems of the operations controller to generate a set of formatted error codes and for storing said set of formatted error codes in said error file, said error filer means further having means for storing the number of errors reported in said error handler context store;
      
      a voter responsive to the base penalty count and error messages currently stored in said message memory, for generating a voted base penalty count for each node at the beginning of each of said fundamental timing intervals and for generating a voted incremental base penalty count for each node which sent a message having at least one reported error;
      
      a base penalty count store storing a current base penalty count for each node;
      
      error consistency checker means for recording said voted base penalty counts and said incremental base penalty counts in said base penalty count store to generate said current base penalty count, said error consistency checker means further having means for identifying as faulty each node whose current base penalty count exceeds an exclusion limit, and for generating a next system state vector identifying each faulty node which is to be excluded from further participation in the operation of the processing system; and
      
      error message generator means for generating said inter-node error message containing said formatted error codes, said current base penalty count, and said incremental base penalty count for each node having at least one error recorded in said error handler context store and for generating said inter-node base penalty count message containing the current base penalty count stored for each node in said base penalty count store.
  - 8. The fault tolerator of claim 7, wherein said error consistency checker means comprises:
    - means for storing said voted base penalty counts in said base penalty count store to generate said current base penalty count;
      
      adder means for adding said voted incremental base penalty count to said current base penalty count stored in said means for storing;
      
      a current system state register storing a current system state vector, said current system state vector having an exclusion bit for each node in the system, said exclusion bits identifying each of the nodes currently excluded from participating in the operation of the processing system;
      
      a next system state register storing a next system state vector, said next system state vector having an exclusion bit for each node identifying each of the nodes to be excluded from participating in the operation of the processing system in its next system state;
      
      threshold comparator means for comparing the current base penalty counts in said base penalty count store with an exclusion threshold value to set the exclusion bit in said next system state register for each node whose current base penalty count exceeds said exclusion threshold value; and
      
      means for passing said current system state and next system state vectors to said scheduler and synchronizer through said synchronizer and scheduler interface means.
  - 9. The fault tolerator of claim 8, wherein said error consistency checker means includes means for decrementing all of said current base penalty counts by a fixed number at predetermined intervals and wherein said threshold comparator means compares said decremented base penalty counts against a readmission threshold value to reset the exclusion flag for each node whose decremented base penalty count is less than said readmittance threshold value.
  - 10. The fault tolerator of claim 9, wherein said error consistency checker means further includes validity checker means for generating a signal inhibiting said voted base penalty count and voted incremental base penalty count from being stored in said base penalty count store in response to said voted base penalty counts and voted incremental base penalty counts being generated by less than a majority of said non-faulty nodes.
  - 11. The fault tolerator of claim 10, wherein the reported error is an asymmetrical type of error which may be detected differently by each node in the processing system and each node may generate a different incremental base penalty count, said error filer means has means responsive to an error code identifying the error as an asymmetrical error for setting an asymmetric error flag in said error handler context store to signify the error is an asymmetric error, said validity checker having means responsive to said asymmetric error flag being set for preventing the generation of said signal inhibiting the writing of the voted incremental base penalty count into the base penalty count store.
  - 12. The fault tolerator of claim 11, wherein said voter has means for generating a missing message vector identifying each node which did not send a base penalty count or error message, and means for generating a deviance vector identifying each node which sent a base penalty count message whose base penalty count differed from said voted base penalty count by more than a first predetermined amount and identifying each node which sent an error message whose incremental base penalty count differed from said voted incremental count by more than a second predetermined amount, said validity checker comprising:
    - majority agree detector means for generating said write inhibit signal in response to said current system state vector, said missing message vector and said deviance vector indicating that the majority of nodes do not agree with the voted base penalty count or the voted incremental base penalty count;
      
      asymmetric error detector means for generating an asymmetric error signal in response to said asymmetric flag being set;
      
      an AND gate for passing said write inhibit signal to said base penalty count store in the absence of said asymmetric error signal; and
      
      error reporter means for generating an internal error report passed to said error filer, said internal error report identifying each node identified by the missing message as not having sent the required message and each node whose error and base penalty count messages had deviance errors as identified by said deviance vector, said error reporter means further being responsive to said asymmetric signal to delete said deviance errors from said internal error report.
  - 13. The fault tolerator of claim 10, wherein said base penalty count store, said next system state register, and said current system state register are part of said error handler context store.
  - 14. The fault tolerator of claim 9, wherein the operations controller has a timing signal generator which generates a fundamental timing period signal defining a fundamental timing interval, a master period signal identifying a master timing period which is a multiple of said fundamental timing period, said error message generator means comprises:
    - means responsive to said master period signal for generating a base penalty count message containing the current base penalty count stored in said base penalty store for each node at the beginning of each master period; and
      
      means responsive to said fundamental timing period signal for generating an error message for each node which sent a message containing an error, said error message containing the set of formatted error codes stored for that node in said error file, the current base penalty count stored for that node in the base penalty count store, and an incremental base penalty count having a value determined by the number and severity of the reported errors.
  - 15. The fault tolerator of claim 14, wherein said error message generator means includes:
    - a penalty weight table storing predetermined penalty weight counts at a plurality of storage locations, each of said storage locations being addressable by a penalty weight pointer, the penalty weight count stored at each location of said plurality of storage locations being indicative of the severity of the errors in an associated formatted error code; and
      
      a group mapping table storing said penalty weight pointers, one pointer associated with a respective one of said plurality of storage locations of said penalty weight table, each of said penalty weight pointers being addressed by a respective one of said formatted error codes stored in said error file;
      
      wherein said error message generator means has means responsive to the recording of each of said formatted error codes in said error file for each node for addressing said group mapping table with said formatted error code to obtain said penalty weight pointer, and means for addressing said penalty weight table with said penalty weight pointer to obtain said penalty count, and means for summing said penalty counts obtained from said penalty weight table to generate said incremental base penalty count for each node.

16. A fault tolerator for a multiple node processing system having a plurality of nodes wherein each node has an applications processor for processing a predetermined set of tasks and an operations controller for controlling the operation of its own node in coordination with all of the other nodes through the exchange of inter-node messages, and wherein each operations controller selects the tasks to be processed by its own node'"'"'s application processor, has means for detecting errors in said inter-node messages and its own operation, and has means for generating internal error reports identifying each detected error, said fault tolerator comprising:
- message memory means for storing the content of all error free inter-node messages;
  
  error file means for storing the content of said internal error reports; and
  
  error handler means having;
  
  means responsive to said content of said error reports filed in said error file means to generate a base penalty count for each node of said plurality of nodes;
  
  means for generating inter-node error messages containing the content of said error file, a current base penalty count and an incremental penalty count for each node which sent a message having an error;
  
  means for generating inter-node base penalty count messages containing the base penalty count generated for each node in the processing system;
  
  means for transmitting said inter-node error and inter-node base penalty count messages to every node in the processing system;
  
  voter means responsive to said inter-node base penalty count message and said inter-node error messages generated by said fault tolerator in each node in the processing system, for generating a voted base penalty count and a voted incremental base penalty count for each node which sent a message having at least one reported error;
  
  a base penalty count store storing said base penalty count for each node in the system;
  
  means for storing said voted base penalty count for each node in said base penalty count store;
  
  means for summing said voted incremental base penalty count for each node to said voted base penalty count stored in said base penalty count store to generate a new current base penalty count for each node;
  
  means for identifying as a faulty node, each node whose new current base penalty count exceeds a predetermined exclusion limit;
  
  means for generating a next state system vector containing the identity of each of said faulty nodes; and
  
  means for excluding from participation in the system each of said faulty nodes identified in said state system vector.
- View Dependent Claims (17, 18)
- - 17. The fault tolerator of claim 16, further having means for periodically decrementing said current base penalty count stored in said base penalty count store to generate a decremented current base penalty count for each node in the system, and means for identifying as no longer being a faulty node, each faulty node whose decremented current base penalty count is less than a predetermined readmittance threshold value;
    - andwherein said means for generating a next state vector removes from said next state vector, each node previously identified as being faulty in response to its decremented current base penalty count being less than said predetermined threshold value.
  - 18. The fault tolerator of claim 17, wherein said multiple node processing system has at least two operating system states and wherein the operating system state of the multiple node processing system is controlled by said next system state vector.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Alliedsignal Inc.
Original Assignee
Alliedsignal Inc.
Inventors
Kiekhafer, Roger M., Finn, Alan M., Walter, Chris J.
Primary Examiner(s)
ATKINSON, CHARLES

Application Number

US07/351,876
Time in Patent Office

396 Days
Field of Search

371/9, 371/11, 371/16, 371/29, 371/9.1, 371/11.3, 371/16.1, 371/29.1, 371/16.5
US Class Current

714/10
CPC Class Codes

G06F 11/0724   in a multiprocessor or a mu...

G06F 11/076   by exceeding a count or rat...

G06F 11/10   Adding special bits or symb...

G06F 11/1425   by reconfiguration of node ...

G06F 11/1482   by means of middleware or O...

G06F 11/1658   Data re-synchronization of ...

G06F 11/181   Eliminating the failing red...

G06F 11/182   based on mutual exchange of...

G06F 11/187   Voting techniques

G06F 11/188   where exact match is not re...

G06F 11/202   where processing functional...

G06F 15/161   Computing infrastructure, e...

G06F 9/4881   Scheduling strategies for d...

Operations controller for a fault tolerant multiple node processing system

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

74 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Operations controller for a fault tolerant multiple node processing system

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

74 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links