System and method for dynamic cluster adjustment to node failures in a distributed data system

US 7,139,925 B2
Filed: 04/29/2002
Issued: 11/21/2006
Est. Priority Date: 04/29/2002
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

a particular node detecting a node failure in a plurality of cluster nodes connected together to form a distributed data cluster having a topology order;

the particular node updating local topology data after said detecting to reflect the node failure;

the particular node determining between the particular node'"'"'s previous node and the particular node'"'"'s next node as to which one is the failed node;

if the failed node corresponding to the node failure is the particular node'"'"'s previous node, the particular node initiating a sequential propagation of a node dead message to other nodes of the distributed data cluster according to a first sequential ordering of nodes, wherein the first sequential ordering comprises the particular node followed by the next node of the particular node; and

if the failed node is the particular node'"'"'s next node,the particular node initiating a sequential propagation of a node dead message to other nodes of the distributed data cluster according to a second sequential ordering of nodes, wherein the second sequential ordering comprises the particular node followed by the previous node of the particular node; and

the particular node transitioning to a reconnecting state to begin reconnecting to a new next node.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A distributed system provides for separate management of dynamic cluster membership and distributed data. Nodes of the distributed system may include a state manager and a topology manager. A state manager handles data access from the cluster. A topology manager handles changes to the dynamic cluster topology. The topology manager enables operation of the state manager by handling topology changes, such as new nodes to join the cluster and node members to exit the cluster. A topology manager may follow a static topology description when handling cluster topology changes. Data replication and recovery functions may be implemented, for example to provide high availability.

187 Citations

26 Claims

1. A method, comprising:
- a particular node detecting a node failure in a plurality of cluster nodes connected together to form a distributed data cluster having a topology order;
  
  the particular node updating local topology data after said detecting to reflect the node failure;
  
  the particular node determining between the particular node'"'"'s previous node and the particular node'"'"'s next node as to which one is the failed node;
  
  if the failed node corresponding to the node failure is the particular node'"'"'s previous node, the particular node initiating a sequential propagation of a node dead message to other nodes of the distributed data cluster according to a first sequential ordering of nodes, wherein the first sequential ordering comprises the particular node followed by the next node of the particular node; and
  
  if the failed node is the particular node'"'"'s next node,the particular node initiating a sequential propagation of a node dead message to other nodes of the distributed data cluster according to a second sequential ordering of nodes, wherein the second sequential ordering comprises the particular node followed by the previous node of the particular node; and
  
  the particular node transitioning to a reconnecting state to begin reconnecting to a new next node.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method as recited in claim 1, further comprising:
    - the particular node in the reconnecting state attempting to connect to one of the plurality of cluster nodes as its next node;
      
      upon said connecting, the particular node transitioning to a joining state and sending a connect request message to its next node; and
      
      the particular node waiting in the joining state to receive a connect complete message.
  - 3. The method as recited in claim 2, further comprising upon receiving the connect complete message from one of the plurality of cluster nodes connected to the particular node as its next node, the particular node transitioning to a joined state configured to operate as a member of the distributed data cluster, wherein in the joined state the particular node is a member of the distributed data cluster in the topology order between its previous node and its next node.
  - 4. The method as recited in claim 2, further comprising:
    - after sending the connect request message the particular node receiving a connect reject message from one of the plurality of cluster nodes, wherein the connect reject message includes data indicating a designated node;
      
      after receiving the connect reject message the particular node transitioning to the reconnecting state, connecting to the designated node, and sending a connect request message to the designated node; and
      
      the particular node waiting in the joining state to receive a connect complete message.
  - 5. The method as recited in claim 1, further comprising the particular node sending a node ping to its previous node and failing to receive the ping message from its next node before said updating local topology data.
  - 6. The method as recited in claim 1, further comprising:
    - the particular node receiving a connect request message from a cluster node in the distributed data cluster, wherein the particular node'"'"'s previous node is the failed node and the cluster node is the previous node of the failed node;
      
      after receiving the connect request message, the particular node transitioning to a transient state and sending a node joined message to its next node including topology data indicating the cluster node as the previous node of the particular node;
      
      the particular node waiting in the transient state to receive a connect complete message from the cluster node; and
      
      upon receiving the connect complete message from the cluster node, the particular node transitioning to a joined state wherein the particular node is connected to the cluster node as its previous node in the cluster topology order.
  - 7. The method as recited in claim 6, wherein the connect complete message includes data indicating each node in the plurality of cluster nodes.

8. A method, comprising:
- a particular node in a cluster of a plurality of nodes receiving a node dead message from one of the plurality of cluster nodes, wherein the plurality of nodes are connected together to form a distributed data cluster having a topology order;
  
  the particular node updating local topology data after said receiving to reflect topology data included in the node dead message;
  
  the particular node determining between its previous node and its next node as to which one sent the node dead message;
  
  if its previous node sent the node dead message, the particular node initiating a sequential propagation of a node dead message to other nodes of the distributed data cluster according to a first sequential ordering of nodes, wherein the first sequential ordering comprises the particular node followed by the next node of the particular node; and
  
  if its next node sent the node dead message, the particular node initiating a sequential propagation of a node dead message to other nodes of the distributed data cluster according to a second sequential ordering of nodes, wherein the second sequential ordering comprises the particular node followed by the previous node of the particular node.
- View Dependent Claims (9, 10)
- - 9. The method as recited in claim 8, wherein the node dead message includes data indicating the one of the plurality of cluster nodes that sent the node dead message to the particular node.
  - 10. The method as recited in claim 8, wherein the particular node appends data identifying the particular node to the node dead message before said initiating.

11. A method, comprising:
- a particular node in a cluster of a plurality of nodes receiving a node dead message from one of the plurality of cluster nodes, wherein the plurality of nodes are connected together to form a distributed data cluster having a topology order, wherein the node dead message includes topology data identifying a given node as a failed node;
  
  if the given node is the particular node'"'"'s previous node, the particular node checking whether the previous node has failed, and if the previous node has not failed, the particular node sending a connect reject message to one of the plurality of cluster nodes, wherein the connect reject message indicates that a failure was incorrectly declared;
  
  if the given node is the particular node'"'"'s next node, the particular node checking whether the next node has failed and if the next node has not failed, the particular node sending a connect reject message to one of the plurality of cluster nodes, wherein the connect reject message indicates that a failure was incorrectly declared; and
  
  otherwise, the particular node updating local topology data to reflect a node failure.

12. A computer system, comprising a processor and memory including instructions executable by the processor for:
- a particular node detecting a node failure in a plurality of cluster nodes connected together to form a distributed data cluster having a topology order;
  
  the particular node updating local topology data after said detecting to reflect the node failure;
  
  the particular node determining between the particular node'"'"'s previous node and the particular node'"'"'s next node as to which one is the failed node;
  
  if the failed node corresponding to the node failure is the particular node'"'"'s previous node, the particular node initiating a sequential propagation of a node dead message to other nodes of the distributed data cluster according to a first sequential ordering of nodes, wherein the first sequential ordering comprises the particular node followed by the next node of the particular node; and
  
  if the failed node is the particular node'"'"'s next node,the particular node initiating a sequential propagation of a node dead message to other nodes of the distributed data cluster according to a second sequential ordering of nodes, wherein the second sequential ordering comprises the particular node followed by the previous node of the particular node; and
  
  the particular node transitioning to a reconnecting state to begin reconnecting to a new next node.
- View Dependent Claims (13, 14, 15, 16, 17, 18)
- - 13. The computer system as recited in claim 12, wherein the instructions are further executable by the processor for:
    - the particular node in the reconnecting state attempting to connect to one of the plurality of cluster nodes as its next node;
      
      upon said connecting, the particular node transitioning to a joining state and sending a connect request message to its next node; and
      
      the particular node waiting in the joining state to receive a connect complete message.
  - 14. The computer system as recited in claim 13, wherein the instructions are further executable by the processor for:
    - upon receiving the connect complete message from one of the plurality of cluster nodes connected to the particular node as its next node, the particular node transitioning to a joined state configured to operate as a member of the distributed data cluster, wherein in the joined state the particular node is a member of the distributed data cluster in the topology order between its previous node and its next node.
  - 15. The computer system as recited in claim 13, wherein the instructions are further executable by the processor for:
    - after sending the connect request message the particular node receiving a connect reject message from one of the plurality of cluster nodes, wherein the connect reject message includes data indicating a designated node;
      
      after receiving the connect reject message the particular node transitioning to reconnecting state, connecting to the designated node, and sending a connect request message to the designated node; and
      
      the particular node waiting in the joining state to receive a connect complete message.
  - 16. The computer system as recited in claim 12, wherein the instructions are further executable by the processor for:
    - the particular node sending a node ping to its previous node and failing to receive the ping message from its next node before said updating local topology data.
  - 17. The computer system as recited in claim 12, wherein the instructions are further executable by the processor for:
    - the particular node receiving a connect request message from a cluster node in the distributed data cluster, wherein the particular node'"'"'s previous node is the failed node and the cluster node is the previous node of the failed node;
      
      after receiving the connect request message, the particular node transitioning to a transient state and sending a node joined message to its next node including topology data indicating the cluster node as the previous node of the particular node;
      
      the particular node waiting in the transient state to receive a connect complete message from the cluster node; and
      
      upon receiving the connect complete message from the cluster node, the particular node transitioning to the joined state wherein the node is connected to the cluster node as its previous node in the cluster topology order.
  - 18. The computer system as recited in claim 17, wherein the connect complete message includes data indicating each node in the plurality of cluster nodes.

19. A computer system, comprising a processor and memory including instructions executable by the processor for:
- a particular node in a cluster of a plurality of nodes receiving a node dead message from one of the plurality of cluster nodes, wherein the plurality of nodes are connected together to form a distributed data cluster having a topology order;
  
  the particular node updating local topology data after said receiving to reflect topology data included in the node dead message;
  
  the particular node determining between its previous node and its next node as to which one sent the node dead message;
  
  if its previous node sent the node dead message, the particular node continuing a sequential propagation of the node dead message to other nodes of the distributed data cluster according to a first sequential ordering of nodes, wherein the first sequential ordering comprises the particular node followed by the next node of the particular node; and
  
  if its next node sent the node dead message, the particular node continuing a sequential propagation of the node dead message to other nodes of the distributed data cluster according to a second sequential ordering of nodes, wherein the second sequential ordering comprises the particular node followed by the previous node of the particular node.
- View Dependent Claims (20, 21)
- - 20. The system as recited in claim 19, wherein the node dead message includes data indicating the one of the plurality of cluster nodes that sent the node dead message to the particular node.
  - 21. The system as recited in claim 19, wherein the particular node appends data identifying the particular node to the node dead message before said continuing.

22. A computer system comprising a processor and memory including instructions executable by the processor for:
- a particular node in a cluster of a plurality of nodes receiving a node dead message from one of the plurality of cluster nodes, wherein the plurality of nodes are connected together to form a distributed data cluster having a topology order, wherein the node dead message includes topology data identifying a given node as a failed node;
  
  if the given node is the particular node'"'"'s previous node, the particular node checking whether the previous node has failed, and if the previous node has not failed, the particular node sending a connect reject message to one of the plurality of cluster nodes, wherein the connect reject message indicates that a failure was incorrectly declared; and
  
  if the given node is the particular node'"'"'s next node, the particular node checking whether the next node has failed, and if the next node has not failed, the particular node sending a connect reject message to one of the plurality of cluster nodes, wherein the connect reject message indicates that a failure was incorrectly declared.

23. A method, comprising:
- a first node and a second node detecting a node failure in a plurality of cluster nodes connected together to form a distributed data cluster having a topology order, wherein the first node is a failed node'"'"'s previous node and the second node is the failed node'"'"'s next node;
  
  the first node and the second updating local topology data after said detecting to reflect the node failure;
  
  the first node sending a node dead message to its previous node and transitioning to a reconnecting state to begin reconnecting to the second node;
  
  the second node sending a node dead message to its next node;
  
  the first node in the reconnecting state connecting to the second node;
  
  after said connecting the first node transitioning to a joining state and sending a connect request message to the second node;
  
  the first node waiting in the joining state to receive a connect complete message;
  
  the second node receiving the connect request message from the first node;
  
  after receiving the connect request message, the second node transitioning to a transient state and sending a node joined message to its next node including data indicating that the first node as the second node'"'"'s previous node;
  
  the second node waiting in the transient state to receive a connect complete message from the first node;
  
  the first node'"'"'s previous node receiving the node joined message and sending the first node a connect complete message;
  
  upon receiving the connect complete message from its previous node, the first node sending a connect complete message to the second node and transitioning to a joined state as a member of the distributed data cluster; and
  
  upon receiving the connect complete message from the first node, the second node transitioning to the joined state wherein the second node is connected to the first node as its previous node in the cluster topology order;
  
  wherein in the joined state the first node is a member of the distributed data cluster in the topology order between its previous node and its next node.

24. A computer-readable storage medium, comprising program instructions, wherein the instructions are computer-executable to:
- at a particular node, detect a node failure in a plurality of cluster nodes connected together to form a distributed data cluster having a topology order;
  
  update local topology data at the particular node to reflect the node failure;
  
  determine between the particular node'"'"'s previous node and the particular node'"'"'s next node as to which one is the failed node;
  
  if the failed node corresponding to the node failure is the particular node'"'"'s previous node, initiate a sequential propagation of a node dead message from the particular node to other nodes of the distributed data cluster according to a first sequential ordering of nodes, wherein the first sequential ordering comprises the particular node followed by the next node of the particular node; and
  
  if the failed node is the particular node'"'"'s next node, initiate a sequential propagation of a node dead message from the particular node to other nodes of the distributed data cluster according to a second sequential ordering of nodes, wherein the second sequential ordering comprises the particular node followed by the next node of the particular node.

25. A method, comprising:
- a particular node in a cluster of a plurality of nodes receiving a node dead message from one of the plurality of cluster nodes, wherein the plurality of nodes are connected together to form a distributed data cluster having a topology order, and wherein the node dead message includes data indicating the one of the plurality of cluster nodes that sent the node dead message to the particular node;
  
  the particular node updating local topology data after said receiving to reflect topology data included in the node dead message;
  
  the particular node determining between its previous node and its next node as to which one sent the node dead message;
  
  if its previous node sent the node dead message, the particular node sending a node dead message to the next node of the particular node; and
  
  if its next node sent the node dead message, the particular node sending a node dead message to the previous node of the particular node.

26. A computer system, comprising a processor and memory including instructions executable by the processor for:
- a particular node in a cluster of a plurality of nodes receiving a node dead message from one of the plurality of cluster nodes, wherein the plurality of nodes are connected together to form a distributed data cluster having a topology order, and wherein the node dead message includes data indicating the one of the plurality of cluster nodes that sent the node dead message to the particular node;
  
  the particular node updating local topology data after said receiving to reflect topology data included in the node dead message;
  
  the particular node determining between its previous node and its next node as to which one sent the node dead message;
  
  if its previous node sent the node dead message, the particular node sending a node dead message to the next node of the particular node; and
  
  if its next node sent the node dead message, the particular node sending a node dead message to the previous node of the particular node.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oracle America, Inc. (Oracle Corporation)
Original Assignee
Sun Microsystems Incorporated (Oracle Corporation)
Inventors
Kannan, Mahesh, Dinker, Darpan, Gopinath, Pramod
Primary Examiner(s)
Bonzo, Bryce P.
Assistant Examiner(s)
PUENTE, EMERSON C

Application Number

US10/134,782
Publication Number

US 20030204786A1
Time in Patent Office

1,667 Days
Field of Search

714/4, 714/717, 709/223, 709/224, 370/242
US Class Current

714/4.3
CPC Class Codes

H04L 41/12   Discovery or management of ...

H04L 45/02   Topology update or discovery

H04L 45/22   Alternate routing

H04L 45/28   using route fault recovery

H04L 45/46   Cluster building

H04L 67/1001   for accessing one among a p...

H04L 67/1034   Reaction to server failures...

System and method for dynamic cluster adjustment to node failures in a distributed data system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

187 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for dynamic cluster adjustment to node failures in a distributed data system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

187 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links