Monitoring distributed software health and membership in a compute cluster
First Claim
1. A method for monitoring health and membership of distributed software in a compute cluster having a plurality of nodes, comprising:
- generating an ordered list of all nodes in the plurality of nodes configured to operate in the compute cluster;
making the ordered list available to each of the plurality of nodes, each of the plurality of nodes having a watchdog component configured to perform health checks and membership checks on other nodes in the compute cluster;
performing a health check by each node in the plurality of nodes using the watchdog component, the health check comprising;
checking a health status of a first neighbor node in a first direction of the node in the ordered list of nodes; and
performing a first action on the neighbor node responsive to determining that the health status of the first neighbor node is unhealthy; and
performing a membership check by each node, using the watchdog component, on a second neighbor node in a second direction opposite the first direction, the membership check comprising;
verifying membership in the compute cluster of a second neighbor node; and
performing a second action on the second neighbor node responsive to determining that the second neighbor node is not a member of the compute cluster;
wherein the ordered list provides a circular sequence of nodes traversable in either the first direction or the second direction.
1 Assignment
0 Petitions
Accused Products
Abstract
Techniques for monitoring distributed software health and membership of nodes and software components operating in a compute cluster are disclosed. In one embodiment, each node in the compute cluster operates a watchdog monitoring component in addition to software operating components. The watchdogs are provided with a list of all nodes in a compute cluster that identifies every node'"'"'s neighboring nodes. Each watchdog checks the health of one of its neighboring node, ensuring that this neighboring node is healthy and is operating successfully. Additionally, each watchdog verifies the cluster membership of its other neighboring nodes to ensure that the cluster is operating an adequate number of operating nodes, and that an adequate number of watchdogs are present in the cluster. If an unhealthy or non-member node is identified, the watchdog may initiate corrective action and attempt to restore the node to a correct operational state.
-
Citations
20 Claims
-
1. A method for monitoring health and membership of distributed software in a compute cluster having a plurality of nodes, comprising:
-
generating an ordered list of all nodes in the plurality of nodes configured to operate in the compute cluster; making the ordered list available to each of the plurality of nodes, each of the plurality of nodes having a watchdog component configured to perform health checks and membership checks on other nodes in the compute cluster; performing a health check by each node in the plurality of nodes using the watchdog component, the health check comprising; checking a health status of a first neighbor node in a first direction of the node in the ordered list of nodes; and performing a first action on the neighbor node responsive to determining that the health status of the first neighbor node is unhealthy; and performing a membership check by each node, using the watchdog component, on a second neighbor node in a second direction opposite the first direction, the membership check comprising; verifying membership in the compute cluster of a second neighbor node; and performing a second action on the second neighbor node responsive to determining that the second neighbor node is not a member of the compute cluster; wherein the ordered list provides a circular sequence of nodes traversable in either the first direction or the second direction. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for monitoring health of nodes within a compute cluster having a plurality of nodes, comprising:
-
obtaining a list of nodes expected to operate within the compute cluster, each node of the list of nodes having an operational watchdog component; performing a health check, by each node using the operational watchdog component, on a first neighbor node in a first direction of each node to discover unhealthy nodes in the compute cluster; performing a membership check by each node using the watchdog component on a second neighbor node in a second direction opposite the first direction; restoring the unhealthy nodes in the compute cluster to a healthy state; and repeating the health check and membership check by each node using the operational watchdog component, the health checks comprising performing the health checks by nodes restored to the healthy state, wherein the list of nodes provides a circular sequence of nodes traversable in one of the first direction and the second direction.
-
-
11. A distributed cluster computing system, comprising:
-
a compute cluster comprising a plurality of nodes; at least one processor within the distributed cluster computing system; at least one memory store within the distributed cluster computing system having instructions operable with the at least one processor for monitoring health and membership of distributed software operating across the plurality of nodes, the instructions being executed on hardware components within the distributed cluster computing system for; generating an ordered list of all nodes in the plurality of nodes which are configured to operate in the compute cluster; making the ordered list available to each of the plurality of nodes, each of the plurality of nodes having a watchdog component configured to perform health checks and membership checks on other nodes in the compute cluster; performing a health check by each node using the watchdog component, the health check comprising; checking a health status of a first neighbor node, to a first direction, by each node; and performing a first action on the first neighbor node responsive to determining that the health status of the first neighbor node is unhealthy; and performing a membership check by each node using the watchdog component, the membership check including; verifying membership in the compute cluster of a second neighbor node in a second direction, opposite of the first direction of a second neighbor node; and performing a second action on the second neighbor node responsive to determining that the second neighbor node is not a member of the compute cluster; wherein the ordered list provides a circular sequence of nodes traversable in either the first direction or the second direction. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A distributed cluster computing system, comprising:
-
a compute cluster comprising a plurality of nodes; at least one processor within the distributed cluster computing system; at least one memory store within the distributed cluster computing system having instructions operable with the at least one processor for monitoring health of the plurality of nodes, the instructions being executed on hardware components within the distributed cluster computing system for; obtaining a list of nodes expected to operate within the compute cluster, each node having an operational watchdog component; performing a health check, by each node within using the operational watchdog component, on a first neighbor node in a first direction of each node to discover unhealthy nodes in the compute cluster; performing a membership check, by each node using the watchdog component, on a second neighbor node in a second direction opposite the first direction; restoring the unhealthy nodes in the compute cluster to a healthy state; and repeating the health check and membership check by each node using the operational watchdog component, the health checks comprising performing the health checks by nodes restored to the healthy state, wherein the list of nodes provides a circular sequence of nodes traversable in one of the first direction and the second direction.
-
Specification