Quorum-based power-down of unresponsive servers in a computer cluster
First Claim
1. A computer-implemented method for handling an unresponsive server in a cluster, the method comprising the steps of:
- each server in the cluster sending a periodic message to other servers in the cluster to indicate proper function of the server sending the periodic message;
each server in the cluster receiving periodic messages from other servers in the cluster that indicate the other servers in the cluster are functioning properly;
generating a membership change message to all servers in the cluster when any of the servers in the cluster become unresponsive;
determining whether the cluster has quorum, wherein a cluster has quorum when a majority of servers in the cluster are responsive, wherein in determining the majority of servers, if there is an odd number of servers in the cluster, each server in the cluster counts as one server, and if there is an even number of servers in the cluster, each server in the cluster that is not a manager of the cluster counts as one server and the manager of the cluster counts as two servers;
receiving an indication of a server failure;
if the majority of servers in the cluster are responsive, performing the steps of;
determining whether the indication of the server failure indicates the manager of the cluster failed;
if the manager of the cluster failed, issuing at least one command to power down all unresponsive servers in the cluster, wherein a server is powered down when the server will not become responsive in the future, wherein an unresponsive server is a server that fails to send a periodic message that indicates the server is functioning properly;
if the manager of the cluster did not fail, issuing at least one command to power down a server corresponding to the received indication of server failure;
determining whether the power down of the at least one of the other servers was successful;
if the power down of the at least one of the other servers was successful, enabling the failing over any resources on the at least one of the other servers that was powered down to at least one server that is responsive; and
if the power down of the at least one of the other servers was not successful, disabling the cluster.
0 Assignments
0 Petitions
Accused Products
Abstract
A quorum-based server power-down mechanism allows a manager in a computer cluster to power-down unresponsive servers in a manner that assures that an unresponsive server does not become responsive again. In order for a manager in a cluster to power down servers in the cluster, the cluster must have quorum, meaning that a majority of the computers in the cluster must be responsive. If the cluster has quorum, and if the manager server did not fail, the manager causes the failed server(s) to be powered down. If the manager server did fail, the new manager causes all unresponsive servers in the cluster to be powered down. If the power-down is successful, the resources on the failed server(s) may be failed over to other servers in the cluster that were not powered down. If the power-down is not successful, the cluster is disabled.
-
Citations
3 Claims
-
1. A computer-implemented method for handling an unresponsive server in a cluster, the method comprising the steps of:
-
each server in the cluster sending a periodic message to other servers in the cluster to indicate proper function of the server sending the periodic message; each server in the cluster receiving periodic messages from other servers in the cluster that indicate the other servers in the cluster are functioning properly; generating a membership change message to all servers in the cluster when any of the servers in the cluster become unresponsive; determining whether the cluster has quorum, wherein a cluster has quorum when a majority of servers in the cluster are responsive, wherein in determining the majority of servers, if there is an odd number of servers in the cluster, each server in the cluster counts as one server, and if there is an even number of servers in the cluster, each server in the cluster that is not a manager of the cluster counts as one server and the manager of the cluster counts as two servers; receiving an indication of a server failure; if the majority of servers in the cluster are responsive, performing the steps of; determining whether the indication of the server failure indicates the manager of the cluster failed; if the manager of the cluster failed, issuing at least one command to power down all unresponsive servers in the cluster, wherein a server is powered down when the server will not become responsive in the future, wherein an unresponsive server is a server that fails to send a periodic message that indicates the server is functioning properly; if the manager of the cluster did not fail, issuing at least one command to power down a server corresponding to the received indication of server failure; determining whether the power down of the at least one of the other servers was successful; if the power down of the at least one of the other servers was successful, enabling the failing over any resources on the at least one of the other servers that was powered down to at least one server that is responsive; and if the power down of the at least one of the other servers was not successful, disabling the cluster.
-
-
2. An apparatus comprising:
-
(A) at least one processor; (B) a memory coupled to the at least one processor; (C) a server process residing in the memory and executed by the at least one processor, wherein the server process resides in a logical partition defined on the apparatus; (D) a cluster engine residing in the memory and executed by the at least one processor, the cluster engine handling communications between the server process and other servers in a cluster, the cluster engine comprising; (D1) a heartbeat mechanism that sends a periodic message to the other servers in the cluster to indicate the server process is functioning properly and that receives periodic messages from the other servers in the cluster that indicate the other servers in the cluster are functioning properly; (D2) a membership change mechanism that generates a membership change message to all servers in the cluster when any of the servers in the cluster become unresponsive; (E) a quorum-based server power-down mechanism residing in the memory and executed by the at least one processor, the quorum-based server power-down mechanism determining whether the server process is part of a group of servers that has quorum in the cluster, wherein a cluster has quorum when a majority of servers in the cluster are responsive, wherein in determining the majority of servers, if there is an odd number of servers in the cluster, each server in the cluster counts as one server, and if there is an even number of servers in the cluster, each server in the cluster that is not a manager of the cluster counts as one server and the manager of the cluster counts as two servers, and if so, the quorum-based server power-down mechanism determining whether the manager of the cluster failed when an indication of a server failure is received, and if the manager of the cluster failed, the quorum-based server power-down mechanism issues at least one command to power down all unresponsive servers in the cluster, wherein a server is powered down when the server will not become responsive in the future, wherein an unresponsive server is a server that fails to send a periodic message that indicates the server is functioning properly, and if a manager of the cluster did not fail, the quorum-based server power-down mechanism issues at least one command to power down a server corresponding to the received indication of server failure, wherein the quorum-based server power-down mechanism determines whether the power down of the at least one of the other servers was successful, and if the power down of the at least one of the other servers was successful, the quorum-based server power-down mechanism enables failing over any resources on the at least one of the other servers that was powered down to at least one server that is responsive, and if the power down of the at least one of the other servers was not successful, the quorum-based server power-down mechanism disables the cluster; and (F) a service processor that receives the command, and in response, powers down at least one of the other servers.
-
-
3. A computer readable recordable media bearing a computer program, the computer program comprising:
(A) a cluster engine that handles communications between a plurality of servers in a cluster, wherein at least one server in the cluster resides in a logical partition, the cluster engine comprising; (A1) a heartbeat mechanism that sends a periodic message to other servers in the cluster to indicate the server process is functioning properly and that receives periodic messages from the other servers in the cluster that indicate the other servers in the cluster are functioning properly; (A2) a membership change mechanism that generates a membership change message to all servers in the cluster when any of the servers in the cluster become unresponsive; and (A3) a quorum-based server power-down mechanism residing in the memory and executed by the at least one processor, the quorum-based server power-down mechanism determining whether the server process is part of a group of servers that has quorum in the cluster, wherein a cluster has quorum when a majority of servers in the cluster are responsive, wherein in determining the majority of servers, if there is an odd number of servers in the cluster, each server in the cluster counts as one server, and if there is an even number of servers in the cluster, each server in the cluster that is not a manager of the cluster counts as one server and the manager of the cluster counts as two servers, and if so, the quorum-based server power-down mechanism determining whether the manager of the cluster failed when an indication of a server failure is received, and if the manager of the cluster failed, the quorum-based server power-down mechanism issues at least one command to power down all unresponsive servers in the cluster, wherein a server is powered down when the server will not become responsive in the future, wherein an unresponsive server is a server that fails to send a periodic message that indicates the server is functioning properly, and if a manager of the cluster did not fail, the quorum-based server power-down mechanism issues at least one command to power down a server corresponding to the received indication of server failure, wherein the quorum-based server power-down mechanism determines whether the power down of the at least one of the other servers was successful, and if the power down of the at least one of the other servers was successful, the quorum-based server power-down mechanism enables failing over any resources on the at least one of the other servers that was powered down to at least one server that is responsive, and if the power down of the at least one of the other servers was not successful, the quorum-based server power-down mechanism disables the cluster.
Specification