System and method for dynamic cluster adjustment to node failures in a distributed data system
First Claim
1. A method, comprising:
- a node detecting a node failure in a plurality of cluster nodes connected together to form a distributed data cluster having a topology order;
the node updating local topology data after said detecting to reflect the node failure;
if the failed node is the node'"'"'s previous node, the node sending a node dead message to its next node; and
if the failed node is the node'"'"'s next node, the node sending a node dead message to its previous node and transitioning to reconnecting state to being reconnecting to a new next node.
2 Assignments
0 Petitions
Accused Products
Abstract
A distributed system provides for separate management of dynamic cluster membership and distributed data. Nodes of the distributed system may include a state manager and a topology manager. A state manager handles data access from the cluster. A topology manager handles changes to the dynamic cluster topology. The topology manager enables operation of the state manager by handling topology changes, such as new nodes to join the cluster and node members to exit the cluster. A topology manager may follow a static topology description when handling cluster topology changes. Data replication and recovery functions may be implemented, for example to provide high availability.
-
Citations
23 Claims
-
1. A method, comprising:
-
a node detecting a node failure in a plurality of cluster nodes connected together to form a distributed data cluster having a topology order;
the node updating local topology data after said detecting to reflect the node failure;
if the failed node is the node'"'"'s previous node, the node sending a node dead message to its next node; and
if the failed node is the node'"'"'s next node, the node sending a node dead message to its previous node and transitioning to reconnecting state to being reconnecting to a new next node. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method, comprising:
-
a node in a cluster of a plurality of nodes receiving a node dead message from one of the plurality of cluster nodes, wherein the plurality of nodes are connected together to form a distributed data cluster having a topology order;
the node updating local topology data after said receiving to reflect topology data included in the node dead message;
if its previous node sent the node dead message, the node sending a node dead message to its next node; and
if its next node sent the node dead message, the node sending a node dead message to its previous node. - View Dependent Claims (9, 10)
-
-
11. A method, comprising:
-
a node in a cluster of a plurality of nodes receiving a node dead message from one of the plurality of cluster nodes, wherein the plurality of nodes are connected together to form a distributed data cluster having a topology order, wherein the node dead message include topology data indicating a failed node;
if the failed node is the node'"'"'s previous node, the node verifying that its previous node has failed and if its previous node is active sending a connect reject message to one of the plurality of cluster nodes; and
if the failed node is the node'"'"'s next node, the node verifying that its next node has failed and if its next node is active sending a connect reject message to one of the plurality of cluster nodes. otherwise the node updating local topology data to reflect the node failure.
-
-
12. A computer system comprising a processor and memory including instructions executable by the processor for:
-
a node detecting a node failure in a plurality of cluster nodes connected together to form a distributed data cluster having a topology order;
the node updating local topology data after said detecting to reflect the node failure;
if the failed node is the node'"'"'s previous node, the node sending a node dead message to its next node; and
if the failed node is the node'"'"'s next node, the node sending a node dead message to its previous node and transitioning to reconnecting state to being reconnecting to a new next node. - View Dependent Claims (13, 14, 15, 16, 17, 18)
-
-
19. A computer system comprising a processor and memory including instructions executable by the processor for:
-
a node in a cluster of a plurality of nodes receiving a node dead message from one of the plurality of cluster nodes, wherein the plurality of nodes are connected together to form a distributed data cluster having a topology order;
the node updating local topology data after said receiving to reflect topology data included in the node dead message;
if its previous node sent the node dead message, the node sending a node dead message to its next node; and
if its next node sent the node dead message, the node sending a node dead message to its previous node. - View Dependent Claims (20, 21)
-
-
22. A computer system comprising a processor and memory including instructions executable by the processor for:
-
a node in a cluster of a plurality of nodes receiving a node dead message from one of the plurality of cluster nodes, wherein the plurality of nodes are connected together to form a distributed data cluster having a topology order, wherein the node dead message include topology data indicating a failed node;
if the failed node is the node'"'"'s previous node, the node verifying that its previous node has failed and sending a connect reject message to one of the plurality of cluster nodes if its previous node is active; and
if the failed node is the node'"'"'s next node, the node verifying that its next node has failed and sending a connect reject message to one of the plurality of cluster nodes if its next node is active.
-
-
23. A method, comprising:
-
a first and a second node detecting a node failure in a plurality of cluster nodes connected together to form a distributed data cluster having a topology order, wherein the first node is the failed node'"'"'s previous node and the second node is the failed node'"'"'s next node;
the first node and the second updating local topology data after said detecting to reflect the node failure;
the first node sending a node dead message to its previous node and transitioning to reconnecting state to being reconnecting to the second node;
the second node sending a node dead message to its previous node;
the first node in reconnecting state connecting to the second node;
after said connecting the first node transitioning to a joining state and sending a connect request message to the second node;
the first node waiting in the joining state to receive a connect complete message;
the second node receiving the connect request message from the first node;
after receiving the connect request message, the second node transitioning to a transient state and sending a node joined message to its next node including data indicating that the first node as the second node'"'"'s previous node;
the second node waiting in the transient state to receive a connect complete message from the first node;
the first node'"'"'s previous node receiving the node joined message and sending the first node a connect complete message;
upon receiving the connect complete message from its previous node, the first node sending a connect complete message to the second node and transitioning to a joined state as a member of the distributed data cluster; and
upon receiving the connect complete message from the first node, the second node transitioning to the joined state wherein the second node is connected to the first node as its previous node in the cluster topology order;
wherein in the joined state the first node is a member of the distributed data cluster in the topology order between its previous node and its next node.
-
Specification