Maintaining process group membership for node clusters in high availability computing systems
First Claim
1. A computing system comprising:
- a plurality of nodes connected by a network;
a cluster membership service operating on the plurality of nodes, the cluster membership service operable to determine membership in a cluster by exchanging messages, wherein a message originating from a node includes a node data area defining the node'"'"'s view of the cluster relationships and wherein the message includes a checkmark data structure in which each node receiving the message sets the checkmark data structure according to whether the receiving node confirms the relationship defined in the node data area;
a group membership service operable to determine membership in a group of nodes formed by a subset of nodes in the cluster of a process executing on a node in the plurality of nodes the group of nodes for an application distributed across two or more of the nodes in the group, said membership communicated between the two or more nodes in the group utilizing a proposal message sent by a coordinator node for receipt by each node in the group and a commit message sent by the coordinator node to each node in the group after receiving acknowledgement that the proposal message has reached each node of the group, and further wherein the plurality of nodes in the group communicate with each other to detect a failure of an application in the group on a first node of the cluster and to transfer applications from the first node to other nodes of the plurality of nodes in the group on detecting the failure.
13 Assignments
0 Petitions
Accused Products
Abstract
A high availability computing system includes a plurality of computer nodes (for example, a server system) connected by a first and a second network, wherein the computer nodes communicate with each other to detect server failure and transfer applications to other computer nodes on detecting server failure. The system incorporates methods of maintaining high availability in a server cluster having a plurality of nodes. A group communications service, a membership service and a system resource manager are instantiated on each node and the group communications service, the membership service and the system resource manager on each node communicate with other nodes to detect node failures and to transfer applications to other nodes on detecting node failure.
96 Citations
17 Claims
-
1. A computing system comprising:
-
a plurality of nodes connected by a network; a cluster membership service operating on the plurality of nodes, the cluster membership service operable to determine membership in a cluster by exchanging messages, wherein a message originating from a node includes a node data area defining the node'"'"'s view of the cluster relationships and wherein the message includes a checkmark data structure in which each node receiving the message sets the checkmark data structure according to whether the receiving node confirms the relationship defined in the node data area; a group membership service operable to determine membership in a group of nodes formed by a subset of nodes in the cluster of a process executing on a node in the plurality of nodes the group of nodes for an application distributed across two or more of the nodes in the group, said membership communicated between the two or more nodes in the group utilizing a proposal message sent by a coordinator node for receipt by each node in the group and a commit message sent by the coordinator node to each node in the group after receiving acknowledgement that the proposal message has reached each node of the group, and further wherein the plurality of nodes in the group communicate with each other to detect a failure of an application in the group on a first node of the cluster and to transfer applications from the first node to other nodes of the plurality of nodes in the group on detecting the failure. - View Dependent Claims (4, 5, 6, 7)
-
-
2. A method of maintaining high availability in a server cluster having a plurality of nodes, the method comprising:
-
determining membership by a cluster membership service in a cluster by exchanging messages, wherein a message originating from a node includes a node data area defining the node'"'"'s view of the cluster relationships and wherein the message includes a checkmark data structure in which each node receiving the message sets the checkmark data structure according to whether the receiving node confirms the relationship defined in the node data area; instantiating a group communications service, a group membership service and a system resource manager on each node of the plurality of nodes, the plurality of nodes forming a group; communicating process membership in the group utilizing a proposal message sent by a coordinator node for receipt by each node in the plurality of nodes and a commit message sent by the coordinator node to each node in the plurality of nodes after receiving acknowledgement that the proposal message has reached each node of the plurality of nodes; communicating between the group communications service, the group membership service and the system resource manager on each node of the group to detect process failures and node failures within the group; upon detecting a failure in a process on a first node of the group, transferring applications to other nodes of the group; and updating, by the group membership service, process membership in a distributed application upon detecting a process failure on a node of the group. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
3. A computer-readable medium having instructions stored thereon, wherein the instructions, when executed in a computer, perform operations comprising:
-
determining membership by a cluster membership service in a cluster by exchanging messages, wherein a message originating from a node includes a node data area defining the node'"'"'s view of the cluster relationships and wherein the message includes a checkmark data structure in which each node receiving the message sets the checkmark data structure according to whether the receiving node confirms the relationship defined in the node data area; instantiating a group communications service, a group membership service and a system resource manager on each node of a plurality of nodes, the plurality of nodes forming a group; communicating process membership in the group utilizing a proposal message including data defining one or more relationships between the plurality of nodes sent by a coordinator node for receipt by each node in the plurality of nodes and a commit message sent by the coordinator node to each node in the plurality of nodes after receiving acknowledgement that the proposal message has reached each node of the plurality of nodes; communicating between the group communications service, the group membership service and the system resource manager on each node of the group to detect process failures and node failures within the group; upon detecting a failure in a process on a first node of the group, transferring applications to other nodes of the group; and updating, by the group membership service, process membership in a distributed application upon detecting a process failure on a node of the group. - View Dependent Claims (13, 14, 15, 16, 17)
-
Specification