Fault tolerant multi-node computing system using periodically fetched configuration status data to detect an abnormal node
First Claim
1. A fault tolerant computing system comprising:
- a plurality of processing nodes interconnected by a communication medium, wherein said processing nodes are configured either in a substantially equal configurations for parallel-running a plurality of uniquely different versions of identical application programs or in a plurality of uniquely different configurations for parallel-running a plurality of substantially equal versions of identical application programs; and
a fault detector connected to said processing nodes via said communication medium for periodically collecting configuration status data from said processing nodes and mutually verifying the configuration status data of said processing nodes with each other for detecting an abnormal node whose operating state is beyond a range of normal deviations of the uniquely different configuration of the node, wherein said fault detector comprises means for detecting configuration status data whose value differs significantly from a data set, based on a statistical test, and formed by the configuration status data of all of said processing nodes and identifying one of said processing nodes which provides the detected configuration status data, andwherein the configuration status data includes, at least, information regarding the processing node'"'"'s operating state and memory usage.
1 Assignment
0 Petitions
Accused Products
Abstract
A fault tolerant computing system comprises a plurality of processing nodes interconnected by a communication medium for parallel-running identical application programs. A fault detector is connected to the processing nodes via the communication medium for periodically collecting configuration status data from the processing nodes and mutually verifying the collected configuration status data for detecting an abnormal node. In one preferred embodiment of this invention, the system operates in a version diversity mode in which the processing nodes are configured in a substantially equal configuration and the application programs are identical programs of uniquely different software versions. In a second preferred embodiment, the system operates in a configuration diversity mode in which the processing nodes are respectively configured in uniquely different configurations. The configurations of the processing nodes are sufficiently different from each other that a software fault is not simultaneously activated by the processing nodes.
-
Citations
33 Claims
-
1. A fault tolerant computing system comprising:
-
a plurality of processing nodes interconnected by a communication medium, wherein said processing nodes are configured either in a substantially equal configurations for parallel-running a plurality of uniquely different versions of identical application programs or in a plurality of uniquely different configurations for parallel-running a plurality of substantially equal versions of identical application programs; and a fault detector connected to said processing nodes via said communication medium for periodically collecting configuration status data from said processing nodes and mutually verifying the configuration status data of said processing nodes with each other for detecting an abnormal node whose operating state is beyond a range of normal deviations of the uniquely different configuration of the node, wherein said fault detector comprises means for detecting configuration status data whose value differs significantly from a data set, based on a statistical test, and formed by the configuration status data of all of said processing nodes and identifying one of said processing nodes which provides the detected configuration status data, and wherein the configuration status data includes, at least, information regarding the processing node'"'"'s operating state and memory usage. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. The fault tolerant computing system comprising:
-
a plurality of processing nodes interconnected by a communication medium, wherein said processing nodes are configured either in a substantially equal configurations for parallel-running a plurality of uniquely different versions of identical application programs or in a plurality of uniquely different configurations for parallel-running a plurality of substantially equal versions of identical application programs; and a fault detector connected to said processing nodes via said communication medium for periodically collecting configuration status data from said processing nodes and mutually verifying the configuration status data of said processing nodes with each other for detecting an abnormal node whose operating state is beyond a range of normal deviations of the uniquely different configuration of the node, a management node including a memory for storing distribution data, means for distributing request messages respectively to said processing nodes according to the stored distribution data, means for receiving response messages as said configuration status data from said processing nodes, wherein said fault detector is provided in said management node for verifying the received response messages with each other for detecting an abnormal node, and updating the stored distribution data if an abnormal node is detected, wherein said management node further comprises decision means for making a determination as to whether said request messages are destined for said plurality of processing nodes or destined for an external processing node, and means for transmitting the request messages to said processing nodes or said external processing node depending on said determination, wherein said decision means calculates an average value from said plurality of response messages and formulates a final response message with said average value, and wherein the configuration status data includes, at least, information regarding the processing node'"'"'s operating state and memory usage. - View Dependent Claims (19, 20, 21)
-
-
22. A management node for a computing system which comprises a plurality of processing nodes interconnected by a communication medium, wherein said processing nodes are arranged to either operate in a hardware diversity mode in which said nodes are set in a plurality of uniquely different configurations, or operate in a software diversity mode in which said nodes are set to parallel-run a plurality of uniquely different versions of identical application programs, comprising:
-
means for periodically collecting configuration status data from said processing nodes via the communication medium; and a fault detector for mutually verifying the configuration status data of the processing nodes with each other for detecting an abnormal node whose operating state is beyond the extent of either said hardware diversity mode or said software diversity mode, wherein said fault detector comprises means for detecting configuration status data whose value differs significantly from a data set, based on a statistical test, formed by the configuration status data of all of said processing nodes and identifying one of said processing nodes which provides the detected configuration status data, and wherein the configuration status data includes, at least, information regarding the processing node'"'"'s operating state and memory usage. - View Dependent Claims (23, 24)
-
-
25. A fault tolerant processing node for a computing system in which said processing node is one of a plurality of processing nodes interconnected by a communication medium, wherein said processing nodes are configured either in a substantially equal configurations for parallel-running a plurality of uniquely different versions of identical application programs or in a plurality of uniquely different configurations for parallel-running a plurality of substantially equal versions of identical application programs, comprising:
-
means for periodically collecting configuration status data from other processing nodes of the computing system via the communication medium; and a fault detector for verifying configuration status data of the processing node with the configuration status data collected from said other processing nodes for detecting an abnormal node whose operating state is either beyond a range of normal deviations of the uniquely different version of the application program of the node or beyond a range of normal deviations of the uniquely different configuration of the node, wherein said fault detector comprises means for detecting configuration status data whose value differs significantly from a data set, based on statistical test, formed by the configuration status data of all of said processing nodes and identifying one of said processing nodes which provides the detected configuration status data, and wherein the configuration status data includes, at least, information regarding the processing node'"'"'s operating state and memory usage. - View Dependent Claims (26)
-
-
27. A method of parallel-running a plurality of processing nodes interconnected by a communication medium, comprising:
-
a) setting said processing nodes either in substantially equal configurations or in uniquely different configurations; b) parallel-running a plurality of uniquely different versions of identical application programs when said nodes are set in said substantially equal configurations or a plurality of substantially equal versions of said identical application programs when said nodes are set in said uniquely different configurations; c) periodically collecting configuration status data from said processing nodes; and d) mutually verifying the collected configuration status data of the processing nodes with each other for detecting an abnormal node whose operating state is either beyond a range of normal deviations of the uniquely different version of the application program of the node or beyond a range of normal deviations of the uniquely different configuration of the node, wherein step (d) comprises detecting configuration status data whose value differs significantly from a data set, based on statistical test, formed by the configuration status data of all of said processing nodes and identifying one of said processing nodes which provides the detected configuration status data, and wherein the configuration status data includes, at least, information regarding the processing node'"'"'s operating state and memory usage. - View Dependent Claims (28, 29, 30, 31, 32, 33)
-
Specification