Fault tolerant computer system
First Claim
1. A fault tolerant computer system, comprising:
- a primary system connected to external devices, including;
a primary central processing unit for executing event processes, each of the event processes being a process executed upon the occurrence of a command at the primary system;
primary memory means connected to the primary central processing unit for storing system data and application data; and
an event generator connected to the primary central processing unit for generating an event message each time the primary central processing unit halts the execution of a halted one of the event processes, the event message at least including information about the type of the halted event process and the reason for halting the execution of the halted event process; and
at least one backup system connected to the primary system, including;
a backup central processing unit for executing the event processes;
backup memory means connected to the backup central processing unit for storing the system data and the application data;
a buffer for receiving and intermediately storing a sequence of the event messages from the primary system; and
backup control means connected to the backup central processing unit, for scheduling the execution of the event processes in accordance with the event messages.
1 Assignment
0 Petitions
Accused Products
Abstract
Fault tolerant computer system and method requiring reduced inter-unit communications. A primary system is arranged to execute event processes in response to received commands. Each time the execution of an event process is halted, due to normal termination or an interrupt, an event generator generates an event message indicating the type of event process and the reason or timing for halting the event process. The event message is used to instruct a backup system to execute the same event process. Since the event message also specifies the reason and the timing for halting the event process, the execution of the event process can be replicated at the backup system. Thus, the primary system and the at least one backup system will be synchronized. At least one standby system may be provided for recording in an event log the sequence of event messages, and for storing an archive copy of memory contents of the primary system. The event log with the archive copy may be used to restore the system state of the primary system.
-
Citations
27 Claims
-
1. A fault tolerant computer system, comprising:
-
a primary system connected to external devices, including;
a primary central processing unit for executing event processes, each of the event processes being a process executed upon the occurrence of a command at the primary system;
primary memory means connected to the primary central processing unit for storing system data and application data; and
an event generator connected to the primary central processing unit for generating an event message each time the primary central processing unit halts the execution of a halted one of the event processes, the event message at least including information about the type of the halted event process and the reason for halting the execution of the halted event process; and
at least one backup system connected to the primary system, including;
a backup central processing unit for executing the event processes;
backup memory means connected to the backup central processing unit for storing the system data and the application data;
a buffer for receiving and intermediately storing a sequence of the event messages from the primary system; and
backup control means connected to the backup central processing unit, for scheduling the execution of the event processes in accordance with the event messages. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
first means for generating first event data indicative of the execution of one of the event processes at the primary system;
second means for generating event data indicative of the execution of the same one of the event process at the at least one backup system; and
means for detecting a system fault based on a comparison of the first and second event data, and, in case a system fault at the primary system is detected, for selecting one of the at least one backup systems to assume function as a new primary system.
-
-
4. The fault tolerant computer system according to claim 1, wherein event data about the execution of the halted event process at the primary system is included into the corresponding event message.
-
5. The fault tolerant computer system according to claim 1, further comprising at least one standby system, including:
-
first standby memory means for receiving and storing an archive copy of the system data and the application data; and
second standby memory means for recording, after the archive copy was generated, a sequence of the event messages in an event log.
-
-
6. The fault tolerant computer system according to claim 5, wherein the standby system further includes:
-
a standby central processing unit connected to the first standby memory means and second standby memory means; and
standby control means connected to the standby central processing unit, for scheduling the execution of a sequence of the event processes corresponding to the sequence of event messages stored in the event log.
-
-
7. The fault tolerant computer system according to claim 1, wherein the external devices are regional processors or distributed central processors of a distributed system.
-
8. The fault tolerant computer system according to claim 1, wherein the event processes are constituted by at least one of the group consisting of:
-
executing a command from a regional processor;
executing a command from a distributed central processor;
scanning of a job table due to a timer interrupt; and
execution of an internal command of the primary system.
-
-
9. The fault tolerant computer system according to claim 1, wherein the event message further includes at least one of the group including:
-
a sequence number indicating an execution sequence of the halted event process;
number of instructions executed;
register states upon occurrence of an interrupt; and
information regarding data defined or accessed by the halted event process.
-
-
10. The fault tolerant computer system according to claim 1, wherein upon detection of a software fault at the primary system, the event message includes information specifying the software fault, and the backup system skips execution of at least part of the corresponding halted event process.
-
11. The fault tolerant computer system according to claim 1, further including:
-
a plurality of processing nodes, each including at least one of the group including;
a primary system of a first processing node;
a backup system of a second processing node;
a standby system of a third processing node; and
means for interconnecting all processing nodes.
-
-
12. The fault tolerant computer system according to claim 11, wherein the primary processing unit, the backup processing unit and at least one of the plurality of processing nodes are constituted by a single processor.
-
13. A fault tolerant computer system, comprising:
-
a primary system connected to external devices, including;
a primary central processing unit for executing event processes, each of the event processes being a process executed upon the occurrence of a command at the primary system;
primary memory means connected to the primary central processing unit for storing system data and application data; and
an event generator connected to the primary central processing unit for generating an event message each time the primary central processing unit halts the execution of one of the event processes, the event message at least including information about the type of halted event process and the reason for halting the execution of the halted event process;
at least one backup system connected to the primary system, including;
a backup central processing unit for executing the event processes;
backup memory means connected to the backup central processing unit for storing the system data and the application data;
a buffer for receiving and intermediately storing a sequence of the event messages from the primary system;
backup control means connected to the backup central processing unit, for scheduling the execution of the event processes in accordance with the event messages; and
at least one standby system, including;
first standby memory means for receiving and storing an archive copy of the system data and the application data; and
second standby memory means for recording, after the archive copy was generated, a sequence of the event messages in an event log.
-
-
14. A fault tolerant computer system, comprising:
-
a primary system connected to external devices, including;
a primary central processing unit for executing event processes, each of the event processes being a process executed upon the occurrence of a command at the primary system;
primary memory means connected to the primary central processing unit for storing system data and application data;
an event generator connected to the primary central processing unit for generating an event message each time the primary central processing unit halts the execution of one of the event processes, the event message at least including information about the type of halted event process and the reason for halting the execution of the halted event process;
at least one backup system connected to the primary system, including;
a backup central processing unit for executing the event processes;
backup memory means connected to the backup central processing unit for storing the system data and the application data;
a buffer for receiving and intermediately storing a sequence of the event messages from the primary system;
backup control means connected to the backup central processing unit, for scheduling the execution of the event processes in accordance with the event messages; and
wherein upon detection of a software fault at the primary system, the event message includes information specifying the software fault, and the backup system skips execution of at least part of the corresponding halted event process.
-
-
15. A method for fault tolerant operation of a computer system, including a primary system and at least one backup system, comprising the steps of:
-
at the primary system;
executing event processes by a primary central processing unit, each of the event processes being a process executed upon the occurrence of a command at the primary system;
generating an event message each time the primary central processing unit halts the execution of one of the event processes, the event message at least including information about the type of the halted event process and the reason for halting execution of the halted event process;
transmitting each event message to the at least one backup system;
at the at least one backup system;
recording and intermediately storing the event messages from the primary system in a buffer;
scheduling the execution of the event processes of the corresponding event messages at the buffer; and
executing the event processes by the backup central processing unit in accordance with the event messages. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
generating first event data indicative of the execution of one of the event process at the primary system;
generating second event data indicative of the execution of the same event process at the at least one backup system; and
detecting a system fault based on a comparison of the first and second event data; and
in case a system fault of the primary system is detected, selecting one of the at least one backup systems to assume function as a new primary system.
-
-
18. The method of fault tolerant operation of a computer system according to claim 15, further comprising the steps of:
-
receiving and storing, at at least one standby system, an archive copy of the system data the and application data from the primary system; and
recording, in an event log at the at least one standby system, a sequence of the event messages, generated at the primary system after the archive copy was generated.
-
-
19. The method of fault tolerant operation of a computer system according to claim 18, further comprising the steps of:
-
scheduling, in case at least one of the standby systems has to assume functions as a backup system, the execution of a sequence of the event processes corresponding to the event messages stored in the event log; and
executing the event processes specified by the event messages at the standby central processing unit and applying corresponding changes to the archive copy.
-
-
20. The method of fault tolerant operation of a computer system according to claim 15, wherein the event processes are constituted by at least one of the group including:
-
executing a command from a regional processor;
executing a command from a distributed central processor;
scanning of a job table due to a timer interrupt; and
execution of an internal function of the primary system.
-
-
21. The method of fault tolerant operation of a computer system according to claim 15, wherein the event message further includes at least one of the group including:
-
a sequence number indicating an execution sequence of the halted event process;
number of instructions executed;
register states upon occurrence of an interrupt; and
information regarding data defined or accessed by the halted event process.
-
-
22. The method of fault tolerant operation of a computer system according to claim 15, wherein upon detection of a software fault at the primary system, the event message includes information specifying the software fault, and the backup system skips execution of at least part of the corresponding halted event process.
-
23. The method of fault tolerant operation of a computer system according to claim 15, wherein the at least one backup system executes the event processes in the order of reception of the corresponding event messages at the buffer or as specified by a sequence number indicating the execution sequence of the event processes at the primary system.
-
24. The method of fault tolerant operation of a computer system according to claim 15, wherein selecting the backup system to assume functions as a new primary system includes:
-
deciding, which of the event processes was the last successfully executed one;
transmitting information on the last successfully executed event process to the at least one backup system and the at least one standby system; and
sending take over messages to the at least one backup and standby system.
-
-
25. The method of fault tolerant operation of a computer system according to claim 15, wherein the primary system communicates to an external device only after the at least one backup system completes execution of a previous one of the event process and a system fault was not detected.
-
26. A method of fault tolerant operation of a computer system, including a primary system, at least one backup system and at least one standby system, comprising the steps of:
-
at the primary system;
executing event processes by a primary central processing unit, each of the event processes being a process executed upon the occurrence of a command at the primary system;
generating an event message each time the primary central processing unit halts the execution of one of the event process, the event message at least including information about the type of the halted event process and the reason for halting execution of the halted event process;
transmitting each event message to at least one backup system;
at the at least one backup system;
recording and intermediately storing the event messages from the primary system in a buffer;
scheduling the execution of the event processes of the corresponding event messages at the buffer;
executing the event processes by the backup central processing unit in accordance with the event messages;
at the at least one standby system;
receiving and storing an archive copy of the system data and the application data from the primary system; and
recording, in an event log at the at least one standby system, a sequence of the event messages, generated at the primary system after the archive copy was generated.
-
-
27. A method of fault tolerant operation of a computer system, including a primary system and at least one backup system, comprising the steps of:
-
at the primary system;
executing event processes by a primary central processing unit, each of the event processes being a process executed upon the occurrence of a command at the primary system;
generating an event message each time the primary central processing unit halts the execution of one of the event processes, the event message at least including information about the type of the halted event process and the reason for halting execution of the halted event process;
transmitting each event message to at least one backup system;
at the at least one backup system;
recording and intermediately storing the event messages from the primary system in a buffer;
scheduling the execution of the event processes of the corresponding event messages at the buffer; and
executing the event processes by the backup central processing unit in accordance with the event messages; and
wherein upon detection of a software fault at the primary system, the event message includes information specifying the software fault, and the backup system skips execution of at least part of the corresponding halted event process.
-
Specification