Fault tolerant multi-node computing system using periodically fetched configuration status data to detect an abnormal node

US 7,870,439 B2
Filed: 05/26/2004
Issued: 01/11/2011
Est. Priority Date: 05/28/2003
Status: Active Grant

First Claim

Patent Images

1. A fault tolerant computing system comprising:

a plurality of processing nodes interconnected by a communication medium, wherein said processing nodes are configured either in a substantially equal configurations for parallel-running a plurality of uniquely different versions of identical application programs or in a plurality of uniquely different configurations for parallel-running a plurality of substantially equal versions of identical application programs; and

a fault detector connected to said processing nodes via said communication medium for periodically collecting configuration status data from said processing nodes and mutually verifying the configuration status data of said processing nodes with each other for detecting an abnormal node whose operating state is beyond a range of normal deviations of the uniquely different configuration of the node, wherein said fault detector comprises means for detecting configuration status data whose value differs significantly from a data set, based on a statistical test, and formed by the configuration status data of all of said processing nodes and identifying one of said processing nodes which provides the detected configuration status data, andwherein the configuration status data includes, at least, information regarding the processing node'"'"'s operating state and memory usage.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A fault tolerant computing system comprises a plurality of processing nodes interconnected by a communication medium for parallel-running identical application programs. A fault detector is connected to the processing nodes via the communication medium for periodically collecting configuration status data from the processing nodes and mutually verifying the collected configuration status data for detecting an abnormal node. In one preferred embodiment of this invention, the system operates in a version diversity mode in which the processing nodes are configured in a substantially equal configuration and the application programs are identical programs of uniquely different software versions. In a second preferred embodiment, the system operates in a configuration diversity mode in which the processing nodes are respectively configured in uniquely different configurations. The configurations of the processing nodes are sufficiently different from each other that a software fault is not simultaneously activated by the processing nodes.

Citations

33 Claims

1. A fault tolerant computing system comprising:
- a plurality of processing nodes interconnected by a communication medium, wherein said processing nodes are configured either in a substantially equal configurations for parallel-running a plurality of uniquely different versions of identical application programs or in a plurality of uniquely different configurations for parallel-running a plurality of substantially equal versions of identical application programs; and
  
  a fault detector connected to said processing nodes via said communication medium for periodically collecting configuration status data from said processing nodes and mutually verifying the configuration status data of said processing nodes with each other for detecting an abnormal node whose operating state is beyond a range of normal deviations of the uniquely different configuration of the node, wherein said fault detector comprises means for detecting configuration status data whose value differs significantly from a data set, based on a statistical test, and formed by the configuration status data of all of said processing nodes and identifying one of said processing nodes which provides the detected configuration status data, andwherein the configuration status data includes, at least, information regarding the processing node'"'"'s operating state and memory usage.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The fault tolerant computing system of claim 1, wherein the configurations of said processing nodes are sufficiently different from each other so that a software fault is not simultaneously activated by said processing nodes.
  - 3. The fault tolerant computing system of claim 1, further comprising means for transmitting a report message to an external destination if an abnormal node is detected by said fault detector.
  - 4. The fault tolerant computing system of claim 1, further comprising means for transmitting a stop command message to one of said plurality of processing nodes if said one processing node is detected by said fault detector as an abnormal node.
  - 5. The fault tolerant computing system of claim 1, further comprising means for connecting a new processing node to said communication medium if an abnormal node is detected by said fault detector.
  - 6. The fault tolerant computing system of claim 1, further comprising means for causing said fault detector to periodically collect and verify configuration status data from said processing nodes a predetermined number of times for detecting stability of said processing nodes if the number of said processing nodes is more than a predetermined number, selecting at least one processing node, and transmitting a stop command message to the selected at least one processing node.
  - 7. The fault tolerant computing system of claim 6, further comprising means for transmitting a stop command message to one of said processing nodes if said one processing node is detected by said fault detector as an abnormal node while said configuration status data is periodically collected and verified for detecting said stability.
  - 8. The fault tolerant computing system of claim 1, further comprising a management node connected to said communication medium for respectively setting said processing nodes in a substantially unified configuration for parallel-running said plurality of uniquely different versions of identical application programs.
  - 9. The fault tolerant computing system of claim 1, further comprising a management node connected to said communication medium for respectively setting said processing nodes in said uniquely different configurations for parallel-running said plurality of substantially equal versions of identical application programs.
  - 10. The fault tolerant computing system of claim 1, wherein said fault detector is provided in each of said processing nodes, the fault detector of each of said processing nodes cooperating with the fault detector of every other node for identifying an abnormal node.
  - 11. The fault tolerant computing system of claim 1, wherein each of said uniquely different configurations comprises hardware configuration, software configuration, external device configuration and program startup configuration.
  - 12. The fault tolerant computing system of claim 11, wherein said hardware configuration is represented by status data which includes parameters indicating main memory size, virtual memory size, memory access timing, processor operating speed, bus transmission speed, bus width, number of processors, read cache memory size, write cache memory size, cache'"'"'s valid/invalid status, and types of processors and memories, and wherein each of said processing nodes differs from every other processing node in one of said parameters of hardware configuration status data.
  - 13. The fault tolerant computing system of claim 11, wherein said software configuration is represented by status data which includes parameters indicating operating system, basic software, various device drivers, and various library versions, and wherein each of said processing nodes differs from every other processing node in one of said parameters of software configuration status data.
  - 14. The fault tolerant computing system of claim 11, wherein said external device configuration is represented by status data which includes parameters indicating types of external devices, display unit, input/output device, and communication interface, and wherein each of said processing nodes differs from every other processing node in one of said parameters of external device configuration status data.
  - 15. The fault tolerant computing system of claim 11, wherein said program startup configuration is represented by status data which includes parameters indicating (a) interruption and reboot of an application program by suspending the CPU while holding contents of main memory, (b) interruption and reboot of an application program by saving contents of the main and virtual memories on a hard disk, suspending operating system and restoring the saved memory contents when operating system is restarted, (c) interruption and reboot of an application program by suspending and restarting operating system, (d) interruption and reboot of an application program by forcibly interrupting and restarting the program and operating system, (e) restarting of an application program and operating system after reinstalling the program and operating system, and (f) restarting an application program and operating system after performing a clear install of the program and operating system, and wherein each of said processing nodes differs from every other processing node in one of said parameters of program startup configuration data.
  - 16. The fault tolerant computing system of claim 1, wherein each of said processing nodes is a real machine.
  - 17. The fault tolerant computing system of claim 1, wherein each of said processing nodes is a virtual machine.

18. The fault tolerant computing system comprising:
- a plurality of processing nodes interconnected by a communication medium, wherein said processing nodes are configured either in a substantially equal configurations for parallel-running a plurality of uniquely different versions of identical application programs or in a plurality of uniquely different configurations for parallel-running a plurality of substantially equal versions of identical application programs; and
  
  a fault detector connected to said processing nodes via said communication medium for periodically collecting configuration status data from said processing nodes and mutually verifying the configuration status data of said processing nodes with each other for detecting an abnormal node whose operating state is beyond a range of normal deviations of the uniquely different configuration of the node,a management node including a memory for storing distribution data, means for distributing request messages respectively to said processing nodes according to the stored distribution data, means for receiving response messages as said configuration status data from said processing nodes,wherein said fault detector is provided in said management node for verifying the received response messages with each other for detecting an abnormal node, and updating the stored distribution data if an abnormal node is detected,wherein said management node further comprises decision means for making a determination as to whether said request messages are destined for said plurality of processing nodes or destined for an external processing node, and means for transmitting the request messages to said processing nodes or said external processing node depending on said determination,wherein said decision means calculates an average value from said plurality of response messages and formulates a final response message with said average value, andwherein the configuration status data includes, at least, information regarding the processing node'"'"'s operating state and memory usage.
- View Dependent Claims (19, 20, 21)
- - 19. The fault tolerant computing system of claim 18, wherein said decision means selects one of said response messages as a system output.
  - 20. The fault tolerant computing system of claim 18, wherein said decision means makes a majority decision on said response messages which arrive in a specified time interval and selects one of the majority-decided response messages as a system output.
  - 21. The fault tolerant computing system of claim 18, wherein said decision means receives an external request message from an external source via said communication medium and sends the received message to one of said processing nodes.

22. A management node for a computing system which comprises a plurality of processing nodes interconnected by a communication medium, wherein said processing nodes are arranged to either operate in a hardware diversity mode in which said nodes are set in a plurality of uniquely different configurations, or operate in a software diversity mode in which said nodes are set to parallel-run a plurality of uniquely different versions of identical application programs, comprising:
- means for periodically collecting configuration status data from said processing nodes via the communication medium; and
  
  a fault detector for mutually verifying the configuration status data of the processing nodes with each other for detecting an abnormal node whose operating state is beyond the extent of either said hardware diversity mode or said software diversity mode,wherein said fault detector comprises means for detecting configuration status data whose value differs significantly from a data set, based on a statistical test, formed by the configuration status data of all of said processing nodes and identifying one of said processing nodes which provides the detected configuration status data, andwherein the configuration status data includes, at least, information regarding the processing node'"'"'s operating state and memory usage.
- View Dependent Claims (23, 24)
- - 23. The management node of claim 22, wherein the configurations of said processing nodes are sufficiently different from each other that a software fault is not simultaneously activated by said processing nodes.
  - 24. The management node of claim 22, further comprising means for transmitting a report message to an external destination if an abnormal node is detected by said fault detector.

25. A fault tolerant processing node for a computing system in which said processing node is one of a plurality of processing nodes interconnected by a communication medium, wherein said processing nodes are configured either in a substantially equal configurations for parallel-running a plurality of uniquely different versions of identical application programs or in a plurality of uniquely different configurations for parallel-running a plurality of substantially equal versions of identical application programs, comprising:
- means for periodically collecting configuration status data from other processing nodes of the computing system via the communication medium; and
  
  a fault detector for verifying configuration status data of the processing node with the configuration status data collected from said other processing nodes for detecting an abnormal node whose operating state is either beyond a range of normal deviations of the uniquely different version of the application program of the node or beyond a range of normal deviations of the uniquely different configuration of the node,wherein said fault detector comprises means for detecting configuration status data whose value differs significantly from a data set, based on statistical test, formed by the configuration status data of all of said processing nodes and identifying one of said processing nodes which provides the detected configuration status data, andwherein the configuration status data includes, at least, information regarding the processing node'"'"'s operating state and memory usage.
- View Dependent Claims (26)
- - 26. The fault tolerant processing node of claim 25, wherein the configuration of said processing node is sufficiently different from other processing nodes of the computing system so that a software fault is not simultaneously activated by said processing nodes.

27. A method of parallel-running a plurality of processing nodes interconnected by a communication medium, comprising:
- a) setting said processing nodes either in substantially equal configurations or in uniquely different configurations;
  
  b) parallel-running a plurality of uniquely different versions of identical application programs when said nodes are set in said substantially equal configurations or a plurality of substantially equal versions of said identical application programs when said nodes are set in said uniquely different configurations;
  
  c) periodically collecting configuration status data from said processing nodes; and
  
  d) mutually verifying the collected configuration status data of the processing nodes with each other for detecting an abnormal node whose operating state is either beyond a range of normal deviations of the uniquely different version of the application program of the node or beyond a range of normal deviations of the uniquely different configuration of the node, wherein step (d) comprises detecting configuration status data whose value differs significantly from a data set, based on statistical test, formed by the configuration status data of all of said processing nodes and identifying one of said processing nodes which provides the detected configuration status data, andwherein the configuration status data includes, at least, information regarding the processing node'"'"'s operating state and memory usage.
- View Dependent Claims (28, 29, 30, 31, 32, 33)
- - 28. The method of claim 27, wherein the configurations of said processing nodes are sufficiently different from each other that a software fault is not simultaneously activated by said processing nodes.
  - 29. The method of claim 27, further comprising transmitting a report message to an external destination if an abnormal node is detected by step (d).
  - 30. The method of claim 27, further comprising transmitting a stop command message to one of said plurality of processing nodes if said one processing node is detected by step (d) as an abnormal node.
  - 31. The method of claim 27, further comprising connecting a new processing node to said communication medium if an abnormal node is detected by step (d).
  - 32. The method of claim 27, further comprising periodically collecting and verifying configuration status data from said processing nodes a predetermined number of times for detecting stability of said processing nodes if the number of said processing nodes is more than a predetermined number, selecting at least one processing node, and transmitting a stop command message to the selected at least one processing node.
  - 33. The method of claim 32, further comprising transmitting a stop command message to one of said processing nodes if said one processing node is detected as an abnormal node while said configuration status data is periodically collected and verified for detecting said stability.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NEC Corporation
Original Assignee
NEC Corporation
Inventors
Fujiyama, Kenichiro, Nakamura, Nobutatsu
Primary Examiner(s)
Beausoilel; Robert
Assistant Examiner(s)
Mehrmanesh; Elmira

Application Number

US10/854,534
Publication Number

US 20040255185A1
Time in Patent Office

2,421 Days
Field of Search

714/4, 714/11, 714/12, 714/37, 714/55, 714/47
US Class Current

714/47.1
CPC Class Codes

G06F 11/006   Identification G06F11/2289 ...

G06F 11/1482   by means of middleware or O...

G06F 11/1641   where the comparison is not...

G06F 2201/815   Virtual

Fault tolerant multi-node computing system using periodically fetched configuration status data to detect an abnormal node

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

Fault tolerant multi-node computing system using periodically fetched configuration status data to detect an abnormal node

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links