Methods and apparatus for process replication/recovery in a distributed system
First Claim
1. A method of executing a process submitted to a host machine in a distributed computing system, the method comprising the steps of:
- starting at least first and second executions of the process on respective first and second remote machines in the distributed computing system;
utilizing one of the first and second executions of the process to provide an on-demand checkpoint for the other execution of the process in the event the other execution is terminated, such that an additional execution of the process is started from the on-demand checkpoint; and
taking a periodic checkpoint of at least one of the first and second executions of the process, the process thereby being protected by a combination of on-demand checkpointing and periodic checkpointing.
7 Assignments
0 Petitions
Accused Products
Abstract
A distributed computing system includes a number of computers, workstations or other computing machines interconnected by a network. A non-interactive process arriving in a host machine of the system is migrated for execution to at least two remote machines. For example, first and second executions of the process may be started on respective first and second remote machines. One of the first and second executions of the process is then used to provide an on-demand checkpoint for the other execution of the process in the event the other execution is terminated, such that an additional execution of the process can be started from the on-demand checkpoint. This on-demand checkpointing is augmented with periodic checkpointing performed on at least one of the multiple executions of the process. The period of the periodic checkpointing for a given execution of the process may be fixed without regard to the status of the on-demand checkpointing for that execution, or alternatively may be reset each time an on-demand checkpoint is taken for that execution.
49 Citations
24 Claims
-
1. A method of executing a process submitted to a host machine in a distributed computing system, the method comprising the steps of:
-
starting at least first and second executions of the process on respective first and second remote machines in the distributed computing system; utilizing one of the first and second executions of the process to provide an on-demand checkpoint for the other execution of the process in the event the other execution is terminated, such that an additional execution of the process is started from the on-demand checkpoint; and taking a periodic checkpoint of at least one of the first and second executions of the process, the process thereby being protected by a combination of on-demand checkpointing and periodic checkpointing. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A distributed computing system comprising:
-
a host machine; and at least first and second remote machines; wherein the host machine is operative to start at least first and second executions of a process on the respective first and second remote machines, and one of the first and second executions of the process are used to provide an on-demand checkpoint for the other execution of the process in the event the other execution is terminated, such that an additional execution of the process is started from the on-demand checkpoint, and wherein the host machine is further operative to take a periodic checkpoint of at least one of the first and second executions of the process, the process thereby being protected by a combination of on-demand checkpointing and periodic checkpointing. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. An apparatus for use in executing a process submitted to a host machine in a distributed computing system, the apparatus comprising:
-
means for starting at least first and second executions of the process on respective first and second remote machines in the distributed computing system; means for utilizing one of the first and second executions of the process to provide an on-demand checkpoint for the other execution of the process in the event the other execution is terminated, such that an additional execution of the process is started from the on-demand checkpoint; and means for taking a periodic checkpoint of at least one of the first and second executions of the process; the process thereby being protected by a combination of on-demand checkpointing and periodic checkpointing.
-
-
18. An apparatus for use in executing a process submitted to a host machine in a distributed computing system, the apparatus comprising:
-
a memory in the host machine, the memory including a process queue for storing information related to the process; and a processor coupled to the memory, wherein the processor is operative to start at least first and second executions of the process on respective first and second remote machines of the system, wherein one of the first and second executions of the process are used to provide an on-demand checkpoint for the other execution of the process in the event the other execution is terminated, such that an additional execution of the process is started from the on-demand checkpoint, and wherein the processor is further operative to initiate a periodic checkpoint of at least one of the corresponding first and second executions of the process, the process thereby being protected by a combination of on-demand checkpointing and periodic checkpointing. - View Dependent Claims (19, 20)
-
-
21. A method of executing a process submitted to a host machine in a distributed computing system, the method comprising the steps of:
-
starting at least first and second executions of the process on respective first and second remote machines in the distributed computing system; utilizing one of the first and second executions of the process to provide an on-demand checkpoint for the other execution of the process in the event the other execution is terminated, such that an additional execution of the process is started from the on-demand checkpoint; and taking a periodic checkpoint of at least one of the first and second executions of the process; wherein the other execution is terminated due to (i) a failure of the corresponding one of the remote machines, or (ii) arrival of an interactive process at the corresponding one of the remote machines, and the additional execution of the process is restarted on a different machine.
-
-
22. A method of executing a process submitted to a host machine in a distributed computing system, the method comprising the steps of:
-
starting at least first and second executions of the process on respective first and second remote machines in the distributed computing system; utilizing one of the first and second executions of the process to provide an on-demand checkpoint for the other execution of the process in the event the other execution is terminated, such that an additional execution of the process is started from the on-demand checkpoint; taking a periodic checkpoint of at least one of the first and second executions of the process; receiving in the host machine an indication that one of the executions of the process has been terminated; initiating the checkpoint by sending a checkpoint request message from the host machine to the remote machine at which the execution of the process is continuing; if a message indicating that the requested checkpoint has been taken is received by the host machine within a designated time out period, starting the additional execution of the process on one of;
(i) the same remote machine, and (ii) another remote machine; andif the message indicating that the requested checkpoint has been taken is not received by the host machine within the designated time out period, restarting the first and second executions of the process on the first and second remote machines.
-
-
23. A distributed computing system comprising:
-
a host machine; and at least first and second remote machines; wherein the host machine is operative to start at least first and second executions of a process on the respective first and second remote machines, and one of the first and second executions of the process are used to provide an on-demand checkpoint for the other execution of the process in the event the other execution is terminated, such that an additional execution of the process is started from the on-demand checkpoint, and wherein the host machine is further operative to take a periodic checkpoint of at least one of the first and second executions of the process, and further wherein the other execution is terminated due to (i) a failure of the corresponding one of the machines, or (ii) arrival of an interactive process at the corresponding one of the machines, and the additional execution of the process is restarted on a different machine.
-
-
24. A distributed computing system comprising:
-
a host machine; and at least first and second remote machines; wherein the host machine is operative to start at least first and second executions of a process on the respective first and second remote machines, and one of the first and second executions of the process are used to provide an on-demand checkpoint for the other execution of the process in the event the other execution is terminated, such that an additional execution of the process is started from the on-demand checkpoint, and wherein the host machine is further operative to take a periodic checkpoint of at least one of the first and second executions of the process, and further wherein the host machine is further operative;
(i) to receive an indication that one of the executions of the process has been terminated;
(ii) to initiate the checkpoint by sending a checkpoint request message to the remote machine at which the execution of the process is continuing;
(iii) if a message indicating that the requested checkpoint has been taken is received within a designated time out period, to start the additional execution of the process on one of;
(a) the same remote machine, and (b) another remote machine; and
(iv) if the message indicating that the requested checkpoint has been taken is not received within the designated time out period, restarting the first and second executions of the process on the first and second remote machines.
-
Specification