Failure detector with consensus protocol
First Claim
Patent Images
1. A method for detecting failure in a computer system comprising the steps of:
- generating and sending a heartbeat signal representative of the vitality of each processor;
sending a message from a processor each time the processor recovers from a failure;
counting the number of recoveries for each processor;
generating an epoch signal representative of the number of recoveries for each processor including the epoch number signal for each processor in its corresponding hearbeat signals.
4 Assignments
0 Petitions
Accused Products
Abstract
New failure detector mechanisms particularly suitable for use in asynchronous distributed computing systems in which processes may crash and recover, and two crash-recovery consensus mechanisms, one requiring stable storage and the other not requiring it. Both consensus mechanisms tolerate link failures and are particularly efficient in the common runs with no failures or failure detector mistakes. Consensus is achieved in such runs within 3□ time and with 4n messages, where □ is the maximum message delay and n is the number of processes in the system.
-
Citations
22 Claims
-
1. A method for detecting failure in a computer system comprising the steps of:
-
generating and sending a heartbeat signal representative of the vitality of each processor;
sending a message from a processor each time the processor recovers from a failure;
counting the number of recoveries for each processor;
generating an epoch signal representative of the number of recoveries for each processor including the epoch number signal for each processor in its corresponding hearbeat signals. - View Dependent Claims (2)
-
-
3. A fault detector for a computer system comprising:
-
means for generating and sending a hearbeat signal representative of the vitality of each processor;
means at each processor in the computer system for sending a message from each processor each time the processor recovers from a failure;
means for counting the number of recoveries for each processor;
means for generating an epoch signal representative of the number of recoveries for each processor;
means for including the epoch number signal for each processor in its corresponding heartbeat signals.
-
-
4. A method for detecting failure in a multi-process computer system comprising the steps of:
-
generating and sending a heartbeat signal representative of the vitality of each process;
generating an epoch number signal corresponding to the number of times each process recovers from a failure; and
including the epoch number signal for each process in its corresponding heartbeat signals. - View Dependent Claims (5)
-
-
6. A multi-process computer system with a heartbeat failure detector comprising:
-
means for generating and sending a heartbeat signal representative of the vitality of each process;
means for generating an epoch number signal corresponding to the number of times each process recovers from a failure; and
means for including the epoch number signal for each process in its corresponding heartbeat signals. - View Dependent Claims (7)
-
-
8. A method for reaching consensus in a multi-process computer system where processes may crash and later recover comprising:
-
selecting one of the processes as a coordinator for a consensus round;
keeping a list of recovered processes at the coordinator;
at the coordinator receiving recovery signals from a recovered process;
at the coordinator adding any newly recovered processes to the recovered process list upon receiving recovery signals;
sending a newround signal from the coordinator to each recovered process;
sending a current estimate with a time stamp from each participant in a round to the coordinator in response of the newround signal;
at the coordinator receiving current estimates and time stamps from processes at the coordinator;
at the coordinator terminating reception of further current estimates and time stamps after the coordinator receives a predetermined number of estimates;
if during the recevieng current estimate step or more new proceses are added into the coordinator'"'"'s recovered process list, then repeating the above collecting estimates step;
choosing the received estimate from the responding processes with the largest time stamp;
broadcasting from the coordinator a new estimate to all processes;
changing the estimate of each participant to the new estimate received from the coordinator;
from each participant sending an acknowledgment signal to the coordinator upon the receipt of the new estimate;
at the coordinator receiving the acknowledgements sent by the processes;
terminating reception at the coordinator of further acknowledgements from processes after the receipts exceed a certain threshold;
if the coordinator adds one or more new processes to its recovered process list during reception of the acknowledgements, then repeating the above collecting acknowledgment step;
broadcasting the current coordinator'"'"'s estimate as the decision value. - View Dependent Claims (9, 10, 11, 12, 13, 14)
generating and sending a heartbeat signal representative of the vitality of each process; generating an epoch number signal corresponding to the number of times each process recovers from a failure; and
including the epoch number signal for each process in its corresponding heartbeat signals.
-
-
10. The method of claim 9 further comprising the step of abandoning the current coordinator process if the epoch number of the current coordinator process increases.
-
11. The method of claim 9 further comprising the step of a process interrupting and terminating the consensus round if the process detects an increase in the epoch number of the coordinator, a recovery signal issued by the coordinator, or receives a message from a subsequent consensus round.
-
12. The method of claim 8 further comprising the step of periodically retransmitting its last message to the other processes.
-
13. The method of claim 8 further comprising the step of locking the decision value in all processes that are still alive.
-
14. The method of claim 13 further wherein the number of acknowledgements for terminating a round and locking a decision value is (max(nb+1, n−
- nb−
|Rp|)).
- nb−
-
15. A multi-process consensus computer system where one or more processes may crash and later recover comprising:
-
means for selecting one of the processes as a coordinator for a consensus round;
means for keeping a list of recovered processes at the coordinator;
means at the coordinator for receiving recovery signals from a recovered process;
means at the coordinator for adding recovered processes to the recovered process list upon receiving recovery signals;
means for sending a newround signal from the coordinator to each recovered process;
means for sending a current estimate with a time stamp from each participant in a round to the coordinator in response of the newround signal;
means at the coordinator for receiving current estimates and time stamps from processes at the coordinator;
means at the coordinator for terminating reception of current estimates and time stamps after the coordinator receives a predetermined number of estimates;
means for recollecting estimates if the coordinator adds one more new processes while collecting estimates;
means for choosing the estimate from the responding processes with the largest time stamp;
means for broadcasting from the coordinator a new estimate to all processes;
means for changing the estimate of each participant to the new estimate received from the coordinator;
at each participant means for sending an acknowledgment signal to the coordinator upon the receipt of the new estimate;
means at the coordinator for terminating of acknowledgements from processes after the receipts exceed a certain threshold;
means at the coordinator for repeating the request for acknowledgements when the coordinator adds one or more processes to its recovery list during reception of the acknowledgements from the processes;
broadcasting the current coordinator'"'"'s estimate as the decision value. - View Dependent Claims (16, 17, 18, 19, 20)
means for generating and sending a heartbeat signal representative of the vitality of each process; means for generating an epoch number signal corresponding to the number of times each process recovers from a failure; and
means for including the epoch number signal for each process in its corresponding heartbeat signals.
-
-
17. The multi-process computer system of claim 15 further comprising
means for abandoning the current coordinator process if the epoch number of the current coordinator process increases. -
18. The multi-process computer system of claim 15 further comprising means for periodically retransmitting its last message to the other processes.
-
19. The multi-process computer system of claim 15 further comprising
means for locking the decision value in each recovered when the number of acknowledgements received by the coordinator is (max(nb+1, n− - nb−
|Rp|)).
- nb−
-
20. The multi-process computer system of claim 15 further comprising
means for interrupting and terminating the consensus round if the process detects an increase in the epoch number of the coordinator, a recovery signal issued by the coordinator, or receives a message from a subsequent consensus round.
-
21. A failure detector for a computer network comprising:
-
a means for listing all the processors that are currently operational;
a means for generating an epoch number for each processor on the list;
a means for keeping all processors that are down off the list;
a means for increasing an epoch number if a processor is unstable or crashes permanently;
a means for augmenting the output of the failure detectors of the computer network with the epoch numbers.
-
-
22. An apparatus for solving consensus in a distributed system where processes may crash and later recover comprising:
-
means for conducting one or more consensus rounds where each round has a designated coordinator for imposing a common decision value on processes participating in the round;
means at each participant for monitoring a heartbeat signal of the coordinator and for terminating the consensus round if the coordinator'"'"'s heartbeat signal indicates the coordinator is dead or unstable;
a lock mechanism for locking a value at each participating process when the coordinator imposes its decision value to some process.
-
Specification