Apparatus and method for fault-tolerant computing
First Claim
1. Fault-tolerant computing apparatus for use in a computer system, said apparatus comprising:
- a monitor for observing the state of a process executing on a processing unit in the computer system and restarting the process when the process is observed to be unable to continue executing, anda plurality of fault tolerant library routines, a selected one or ones of said fault tolerant library routines being invocable by said process to provide said process with a corresponding one of a plurality of degrees of fault tolerance such that said observing by said monitor is performed in accordance with said selected fault tolerant routine.
7 Assignments
0 Petitions
Accused Products
Abstract
Techniques for fault-tolerant computing which do not require fault-tolerant hardware or a fault-tolerant operating system. The techniques employ a monitor daemon which is implemented as one or more user processes and a fault-tolerant library which can be bound into application programs. A user process which is executing on ordinary hardware under an ordinary operating system is made fault tolerant by registering it with the monitor daemon. The degree of fault tolerance can be controlled by means of the fault-tolerant library. Included in the fault-tolerant library is a function which defines portions of a user process'"'"'s memory as critical memory, a function which copies the critical memory to persistent storage, and a function which restores the critical memory from persistent storage. The monitor daemon monitors fault-tolerant processes, and when such a process hangs or crashes, the daemon restarts it. When the techniques are employed in a multi-node system, the monitor daemon on each node monitors one other node in addition to the processes in its own node. In addition, the monitor daemon may maintain copies of the state of fault-tolerant processes running at least on the monitored node. When the monitored node fails, the monitor daemon starts the processes from the monitored node for which the monitor daemon has state on its own node. When a node leaves or rejoins the multi-node system, what other node a given monitor daemon monitors is automatically redetermined for the new configuration of the multi-node system.
-
Citations
29 Claims
-
1. Fault-tolerant computing apparatus for use in a computer system, said apparatus comprising:
-
a monitor for observing the state of a process executing on a processing unit in the computer system and restarting the process when the process is observed to be unable to continue executing, and a plurality of fault tolerant library routines, a selected one or ones of said fault tolerant library routines being invocable by said process to provide said process with a corresponding one of a plurality of degrees of fault tolerance such that said observing by said monitor is performed in accordance with said selected fault tolerant routine. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer system for fault tolerant computing comprising:
-
at least one processor for executing user level processes; a first user level process executing on said at least one processor; a user level daemon process executing on said at least one processor; means for providing a registration message specifying said first user level process to said user level daemon process, said user level daemon process being responsive to said registration message by initiating observation of said first user level process to determine whether said first user level process is unable to continue execution and for restarting said first user level process when said first user level process is observed to be unable to continue execution. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer system for fault tolerant computing comprising:
-
a plurality of nodes, each of said nodes comprising at least one processor for executing user level processes; a first user level process executing in a first one of said nodes; a first user level daemon process executing in a second one of said nodes for actively polling said first one of said nodes for determining whether said first one of said nodes is inoperable, and for restarting said first user level process on said second one of said nodes when said first user level daemon process determines that said first one of said nodes is inoperable; a second user process executing in said second one of said nodes, wherein said first user level daemon further observes whether said second user process is unable to continue execution and restarts said second user level process on said second one of said nodes when said second user level process is observed to be unable to continue; a second one of said user level daemon processes executing on said first one of said nodes, said second one of said user level daemon processes copies state information from said first user level process to said first user level daemon process, and said first user level daemon process employs said state information in restarting execution of said first user level process; and computer program code executed by said first user level process, said computer program code comprising a first fault tolerant library routine which, when executed, saves said state information.
-
-
18. A distributed computer system for fault tolerant computing comprising:
-
a plurality of nodes, each of said nodes comprising; at least one processor for executing user level processes; a user level daemon process executing on said at least one processor; means for providing a registration message specifying a user level process to said user level daemon, said user level daemon being responsive to said registration message by initiating observation of a specified user level process to determine whether said specified user level process is unable to continue execution and for restarting said specified user level process when said specified user level process is observed to be unable to continue executing; and wherein a first one of said user level daemon processes executing in a first one of said plurality of nodes further observes whether a second one of said plurality of nodes is operating. - View Dependent Claims (19, 20, 21, 22, 23, 24)
-
-
25. A method for operation of a fault tolerant computer system, said system comprising at least one processor for executing user level processes, said method comprising the steps of:
providing a registration message specifying a particular user level process which is executing on a processor to a user level daemon process, said user level daemon process being responsive to said registration message for performing the steps of; monitoring said specified user level process to determine whether said specified user level process is unable to continue executing; and restarting said specified user level process if said specified user level process is unable to continue executing. - View Dependent Claims (26, 27, 28)
-
29. A method for operation of a fault tolerant distributed computer system, said system comprising a plurality of nodes, each node comprising at least one processor for executing user level processes, the method comprising the steps of:
-
executing a first user level daemon process on a first node, said first user level daemon process performing the steps of; monitoring a second node to determine whether said second node is inoperable, wherein said second node is executing a second user level process; and restarting said second user level process on said first node when it is determined that said second node is inoperable; executing a third user level process on said first node;
said first user level daemon process performing the further steps of;monitoring said third user level process to determine whether said third user level process is unable to continue executing; restarting said third user level process if said third user level process is unable to continue executing; and executing a second user level daemon process on said second node;
said second user level daemon process performs the step of;copying state information from said second user level process to said first user level daemon;
said step of restarting said second user level process on said first node further comprises the step of;restarting said second user level process using said state information.
-
Specification