Operations controller for a fault tolerant multiple node processing system
First Claim
1. In a multiple node fault tolerant processing system having a plurality of nodes wherein each node has an applications processor for executing a predetermined set of tasks and an operations controller for controlling its own node in coordination with all of the other nodes of said plurality of nodes through the exchange of inter-node messages and wherein said operations controller selects the tasks to be executed by the applications processor from said predetermined set of tasks, each operations controller having a plurality of subsystems including a message checker, a scheduler, a synchronizer and a voter, each of which is capable of detecting errors and generating internal error reports identifying each error detected, each operations controller further having at least two operating system states and operative to switch from one operating system state to another in response to the exclusion of a faulty node or the readmittance of a healthy node which changes the number of nodes operating in the processing system, a fault tolerator for said operations controller comprising:
- a message memory storing the content of all inter-node messages received by said operations controller;
an error file storing the content of said internal error reports generated by said message checker, said scheduler, said synchronizer and said voter;
error handler means for storing said error reports in said error file and for generating a base penalty count for each node of said plurality of nodes from the content of said error file, said base penalty count being indicative of the operational status of the associated node, said error handler means further having means for determining which nodes are faulty and for excluding such faulty nodes from participating in the operation of said multiple node processing system, in coordination with all of the other nodes in the system, through the exchange of inter-node messages, said inter-node messages including error messages containing the content of said error file for a particular node and a base penalty count message containing said base penalty count of each node; and
interface means for storing all of the messages passed by the message checker in said message memory, for passing the identities of the faulty nodes to the scheduler and the synchronizer, and for passing all error reports to said error handler.
0 Assignments
0 Petitions
Accused Products
Abstract
A fault tolerator for an operations controller of a multiple node fault tolerant processing system having a data memory for storing the content of all received error free messages, an error file for storing the content of all received inner node error reports, an error handler for generating a base penalty count for each node based on the content of the errors recorded in the error file and for excluding each node from the operation of the multiple node processing system whose base penalty count exceeds an exclusion threshold. The fault tolerator also includes a synchronizer interface for passing the selected fields of the received messages to a synchronizer, a scheduler interface for passing selected information to a scheduler, and a message interface which stores the error free messages in the data memory and passes the selected fields of the messages to the synchronizer.
74 Citations
18 Claims
-
1. In a multiple node fault tolerant processing system having a plurality of nodes wherein each node has an applications processor for executing a predetermined set of tasks and an operations controller for controlling its own node in coordination with all of the other nodes of said plurality of nodes through the exchange of inter-node messages and wherein said operations controller selects the tasks to be executed by the applications processor from said predetermined set of tasks, each operations controller having a plurality of subsystems including a message checker, a scheduler, a synchronizer and a voter, each of which is capable of detecting errors and generating internal error reports identifying each error detected, each operations controller further having at least two operating system states and operative to switch from one operating system state to another in response to the exclusion of a faulty node or the readmittance of a healthy node which changes the number of nodes operating in the processing system, a fault tolerator for said operations controller comprising:
-
a message memory storing the content of all inter-node messages received by said operations controller; an error file storing the content of said internal error reports generated by said message checker, said scheduler, said synchronizer and said voter; error handler means for storing said error reports in said error file and for generating a base penalty count for each node of said plurality of nodes from the content of said error file, said base penalty count being indicative of the operational status of the associated node, said error handler means further having means for determining which nodes are faulty and for excluding such faulty nodes from participating in the operation of said multiple node processing system, in coordination with all of the other nodes in the system, through the exchange of inter-node messages, said inter-node messages including error messages containing the content of said error file for a particular node and a base penalty count message containing said base penalty count of each node; and interface means for storing all of the messages passed by the message checker in said message memory, for passing the identities of the faulty nodes to the scheduler and the synchronizer, and for passing all error reports to said error handler. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A fault tolerator for a multiple node processing system having a plurality of nodes wherein each node has an applications processor for processing a predetermined set of tasks and an operations controller for controlling the operation of its own node in coordination with all of the other nodes through the exchange of inter-node messages, and wherein each operations controller selects the tasks to be processed by its own node'"'"'s application processor, has means for detecting errors in said inter-node messages and its own operation, and has means for generating internal error reports identifying each detected error, said fault tolerator comprising:
-
message memory means for storing the content of all error free inter-node messages; error file means for storing the content of said internal error reports; and error handler means having; means responsive to said content of said error reports filed in said error file means to generate a base penalty count for each node of said plurality of nodes; means for generating inter-node error messages containing the content of said error file, a current base penalty count and an incremental penalty count for each node which sent a message having an error; means for generating inter-node base penalty count messages containing the base penalty count generated for each node in the processing system; means for transmitting said inter-node error and inter-node base penalty count messages to every node in the processing system; voter means responsive to said inter-node base penalty count message and said inter-node error messages generated by said fault tolerator in each node in the processing system, for generating a voted base penalty count and a voted incremental base penalty count for each node which sent a message having at least one reported error; a base penalty count store storing said base penalty count for each node in the system; means for storing said voted base penalty count for each node in said base penalty count store; means for summing said voted incremental base penalty count for each node to said voted base penalty count stored in said base penalty count store to generate a new current base penalty count for each node; means for identifying as a faulty node, each node whose new current base penalty count exceeds a predetermined exclusion limit; means for generating a next state system vector containing the identity of each of said faulty nodes; and means for excluding from participation in the system each of said faulty nodes identified in said state system vector. - View Dependent Claims (17, 18)
-
Specification