Dynamic rate heartbeating for inter-node status updating
First Claim
1. A method for determining node operating status among a cluster of nodes of a computer system, the method comprising:
- transmitting gossip messages directly between pairs of nodes of the cluster, the gossip messages containing an indication of operational status of nodes of the cluster other than the nodes corresponding to the pair of nodes between which the gossip message is communicated, wherein the transmitting is performed periodically according to a heartbeat rate;
receiving the gossip messages at a receiving one of the corresponding pair of nodes and storing indications of communications delays for the gossip messages;
computing statistics of the communications delay for the gossip messages, wherein the computing computes a mean and mean deviation of the communications delays, and wherein the mean of the communications delay is a mean round-trip communications time (MRT) computed according to the formula MRT(t)=0.875*MRT(t−
Δ
t)+0.125*TMEAS, where Δ
t is a period corresponding to the heartbeat rate and TMEAS is the most-recently measured round-trip communications time, and wherein the mean deviation of the communications delay D is computed according to D(t)=0.125*D(t−
Δ
t)+0.875*ERR, where ERR=|MRT|−
TMEAS;
adjusting parameters for node status monitoring according to the computed statistics, wherein the adjusting parameters of the node status monitoring comprises adjusting a threshold maximum number of missed receptions of the receiving used to determine whether a node is operational according to the mean round-trip communications time and the mean deviation of the communications delay; and
monitoring the operational status of the nodes according to the indications of communications delay, the parameters, and the operational status of the other nodes in the cluster as communicated by the gossip messages.
0 Assignments
0 Petitions
Accused Products
Abstract
A scheme for monitoring node operational status according to communications transmits messages periodically according to a heartbeat rate among the nodes. The messages may be gossip messages containing the status of the other nodes in the pairs, are received at the nodes and indications of the communications delays of the received messages are stored, which are used to compute statistics of the stored communications delays. Parameters of the node status monitoring, which are used for determining operational status of the nodes, are adjusted according to the statistics, which may include adjusting the heartbeat rate, the maximum wait time before a message is considered missed, and/or the maximum number of missed messages, e.g., the sequence number deviation, before the node is considered non-operational (down).
86 Citations
12 Claims
-
1. A method for determining node operating status among a cluster of nodes of a computer system, the method comprising:
-
transmitting gossip messages directly between pairs of nodes of the cluster, the gossip messages containing an indication of operational status of nodes of the cluster other than the nodes corresponding to the pair of nodes between which the gossip message is communicated, wherein the transmitting is performed periodically according to a heartbeat rate; receiving the gossip messages at a receiving one of the corresponding pair of nodes and storing indications of communications delays for the gossip messages; computing statistics of the communications delay for the gossip messages, wherein the computing computes a mean and mean deviation of the communications delays, and wherein the mean of the communications delay is a mean round-trip communications time (MRT) computed according to the formula MRT(t)=0.875*MRT(t−
Δ
t)+0.125*TMEAS, where Δ
t is a period corresponding to the heartbeat rate and TMEAS is the most-recently measured round-trip communications time, and wherein the mean deviation of the communications delay D is computed according to D(t)=0.125*D(t−
Δ
t)+0.875*ERR, where ERR=|MRT|−
TMEAS;adjusting parameters for node status monitoring according to the computed statistics, wherein the adjusting parameters of the node status monitoring comprises adjusting a threshold maximum number of missed receptions of the receiving used to determine whether a node is operational according to the mean round-trip communications time and the mean deviation of the communications delay; and monitoring the operational status of the nodes according to the indications of communications delay, the parameters, and the operational status of the other nodes in the cluster as communicated by the gossip messages. - View Dependent Claims (2, 3, 4)
-
-
5. A computer system comprising a processing cluster including a plurality of physical or virtual processing nodes, the computer system comprising at least one processor for executing program instructions and at least one memory coupled to the processor for executing the program instructions, wherein the program instructions are program instructions for determining node operating status among a cluster of the physical or virtual processing nodes, the program instructions comprising program instructions for:
-
transmitting gossip messages directly between pairs of nodes of the cluster, the gossip messages containing an indication of operational status of nodes of the cluster other than the nodes corresponding to the pair of nodes between which the gossip message is communicated, wherein the transmitting is performed periodically according to a heartbeat rate; receiving the gossip messages at a receiving one of the corresponding pair of nodes and storing indications of communications delays for the messages; computing statistics of the communications delay for the gossip messages, wherein the program instructions for computing compute a mean and mean deviation of the communications delays, and wherein the program instruction for computing compute the mean of the communications delay as a mean round-trip communications time (MRT) according to the formula MRT(t)=0.875*MRT(t−
Δ
t)+0.125*TMEAS, where Δ
t is a period corresponding to the heartbeat rate and TMEAS is the most-recently measured round-trip communications time, and compute the mean deviation of the communications delay D according to D(t)=0.125*D(t−
Δ
t)+0.875*ERR, where ERR=|MRT|−
TMEAS;adjusting parameters for node status monitoring according to the computed statistics, wherein the program instructions for adjusting the parameters of the node status monitoring adjust a threshold maximum number of missed receptions of the receiving used to determine whether a node is operational according to the mean round-trip communications time and the mean deviation of the communications delay; and monitoring the operational status of the nodes according to the indications of communications delay, the parameters, and the operational status of the other nodes in the cluster as communicated by the gossip messages. - View Dependent Claims (6, 7, 8)
-
-
9. A computer program product comprising a computer-readable storage device storing program instructions for execution within a computer system, the computer system comprising a processing cluster including a plurality of physical or virtual processing modes, wherein the program instructions are program instructions for determining node operating status among a cluster of the physical or virtual processing nodes, the program instructions comprising program instructions for:
-
transmitting gossip messages directly between pairs of nodes of the cluster, the gossip messages containing an indication of operational status of nodes of the cluster other than the nodes corresponding to the pair of nodes between which the gossip message is communicated, wherein the transmitting is performed periodically according to a heartbeat rate; receiving the gossip messages at a receiving one of the corresponding pair of nodes and storing indications of communications delays for the messages; computing statistics of the communications delay for the gossip messages, wherein the program instructions for computing compute a mean and mean deviation variance of the communications delays, and wherein the program instruction for computing compute the mean of the communications delay as a mean round-trip communications time (MRT) according to the formula MRT(t)=0.875*MRT(t−
Δ
t)+0.125*TMEAS, where Δ
t is a period corresponding to the heartbeat rate and TMEAS is the most-recently measured round-trip communications time, and compute the mean deviation of the communications delay D according to D(t)=0.125*D(t−
Δ
t)+0.875*ERR, where ERR=|MRT|−
TMEAS;adjusting parameters for node status monitoring according to the computed statistics, wherein the program instructions for adjusting the parameters of the node status monitoring adjust a threshold maximum number of missed receptions of the receiving used to determine whether a node is operational according to the mean round-trip communications time and the mean deviation of the communications delay; and monitoring the operational status of the nodes according to the indications of communications delay, the parameters, and the operational status of the other nodes in the cluster as communicated by the gossip messages. - View Dependent Claims (10, 11, 12)
-
Specification