Automated node restart in clustered computer system
First Claim
1. A method of restarting a node in a clustered computer system, wherein the clustered computer system hosts a group including first and second members that reside respectively on first and second nodes, the method comprising:
- (a) in response to a clustering failure on the first node, notifying the second member of the group using the first member by issuing a request to the group; and
(b) in response to the notification, initiating a restart of the first node using the second member;
wherein the group further includes a third member that resides on a third node in the clustered computer system, and wherein issuing the request to the group includes forwarding the request to each of the second and third members of the group, the method further comprising selecting the second member to initiate the restart of the first node.
1 Assignment
0 Petitions
Accused Products
Abstract
An apparatus, program product and method initiate a restart of a node in a clustered computer system using a member of a clustering group that resides on a different node from that to be restarted. Typically, a restart operation is initiated by the member in response to a membership change message sent by another group member that is resident on the node to be restarted, with an indicator associated with the membership change message that indicates that a restart should be initiated. Typically, the restart is implemented in much the same manner as a start operation that is performed when a node is initially added to a cluster, with additional functionality utilized to preclude repeated restart attempts upon a failure of a prior restart operation.
101 Citations
35 Claims
-
1. A method of restarting a node in a clustered computer system, wherein the clustered computer system hosts a group including first and second members that reside respectively on first and second nodes, the method comprising:
-
(a) in response to a clustering failure on the first node, notifying the second member of the group using the first member by issuing a request to the group; and
(b) in response to the notification, initiating a restart of the first node using the second member;
wherein the group further includes a third member that resides on a third node in the clustered computer system, and wherein issuing the request to the group includes forwarding the request to each of the second and third members of the group, the method further comprising selecting the second member to initiate the restart of the first node. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
wherein notifying the second member of the group using the first member is performed in response to detecting the clustering failure in the first node and determining that the clustering failure did not occur during a restart of the first node.
-
-
8. The method of claim 7, further comprising signaling an error in response to detecting the clustering failure in the first node if the clustering failure occurred during a restart of the first node.
-
9. The method of claim 7, wherein determining whether the clustering failure occurred during a restart of the first node includes determining whether the start node request indicates that the start node request is for the purpose of restarting the first node.
-
10. The method of claim 9, further comprising:
-
(a) counting protocols processed by the first node after a restart; and
(b) signaling an error in response to detecting the clustering failure in the first node if the clustering failure occurred during a restart of the first node and the number of protocols processed by the first node after the restart is less than a predetermined threshold.
-
-
11. The method of claim 5, further comprising, in response to the clustering failure on the first node, terminating clustering on the first node after notifying the second member of the group using the first member.
-
12. A method of restarting a node in a clustered computer system, wherein the clustered computer system hosts a group including first and second members that reside respectively on first and second nodes, the method comprising:
-
(a) in response to a clustering failure on the first node, notifying the second member of the group using the first member; and
(b) in response to the notification, initiating a restart of the first node using the second member;
wherein notifying the second member comprises issuing a membership change request to the group using the first member, wherein issuing the membership change request includes indicating in association with the membership change request that the membership change request is for the purpose of restarting the first node; and
wherein indicating that the membership change request is for the purpose of restarting the first node includes setting a reason field in the membership change request to a restart value.
-
-
13. A method of restarting a node in a clustered computer system, wherein the clustered computer system hosts a group including first and second members that reside respectively on first and second nodes, the method comprising:
-
(a) in response to a clustering failure on the first node, notifying the second member of the group using the first member;
(b) in response to the notification, initiating a restart of the first node using the second member; and
(c) in response to the notification, selecting the second member from a plurality of members in the group to initiate the restart of the first node;
wherein selecting the second member to initiate the restart of the first node includes determining that the second member is a lowest named member among the plurality of members.
-
-
14. A method of restarting a node among a plurality of nodes in a clustered computer system, wherein the clustered computer system hosts a cluster control group including a plurality of cluster control members, each residing respectively on a different node from the plurality of nodes, the method comprising:
-
(a) detecting a clustering failure on a first node among the plurality of nodes;
(b) in response to detecting the clustering failure on the first node, issuing a membership change request from the first node to the cluster control member on at least second and third nodes among the plurality of nodes, the membership change request indicating that the membership change request is for the purpose of restarting the first node;
(c) terminating clustering on the first node after issuing the membership change request;
(d) in response to the membership change request, selecting the second node to restart the first node from among the at least second and third nodes receiving the membership change request, wherein the at least second and third nodes receiving the membership change request are different from the first node;
(d) after selecting the second node, issuing a start node request using the selected second node, the start node request indicating that the purpose of the start node request is for restarting the first node; and
(e) in response to the start node request, restarting the first node by initiating clustering on the first node.
-
-
15. A method of restarting a node among a plurality of nodes in a clustered computer system, wherein the clustered computer system hosts a cluster control group including a plurality of cluster control members, each residing respectively on a different node from the plurality of nodes, the method comprising:
-
(a) detecting a clustering failure on a firs node among the plurality of nodes;
(b) in response to detecting the clustering failure on the first node, issuing a membership change request from the first node to the cluster control member on each other node in the plurality of nodes, the membership change request indicating that the membership change request is for the purpose of restarting the first node;
(c) terminating clustering on the first node after issuing the membership change request;
(d) in response to the membership change request, selecting a second node from the plurality of nodes that is different from the first node;
(d) issuing a start node request using the selected second node, the start node request indicating that the purpose of the start node request is for restarting the first node;
(e) in response to the start node request, initiating clustering on the first node; and
(f) in response to a second clustering failure during initiation of clustering on the first node;
(i) determining from the start node request that initiated clustering on the first node that the purpose of the start node request is for restarting the first node; and
(ii) in response to determining that the start node request is for restarting the first node, signaling an error instead of initiating a second restart of the first node.
-
-
16. An apparatus, comprising:
-
(a) a memory accessible by a node in a clustered computer system; and
(b) a program resident in the memory, the program configured to initiate a restart of another node in the clustered computer system in response to a notification from the other node of a clustering failure on the other node, wherein the program comprises a member of a group hosted by the clustered computer system, the group including an additional member residing on the other node, wherein the notification comprises a request issued to the group by the additional member, and wherein the program is further configured to determine, in response to the request, that the node upon which the program is resident should initiate the restart of the other node, such that only one node in the clustered computer system initiates the restart of the other node in response to the request. - View Dependent Claims (17, 18, 19, 20, 21)
-
-
22. A clustered computer system, comprising:
-
(a) first, second, and third nodes coupled to one another over a network; and
(b) a group including first, second, and third members, the first member resident on the first node, the second member resident on the second node, and the third member resident on the third node, wherein the first member is configured to notify the second and third members in response to a clustering failure on the first node by issuing a request to the group that is forwarded to each of the second and third members of the group, wherein the group is configured to select the second member to initiate a restart of the first node in response to the notification, and wherein the second member is configured to initiate the restart of the first node in response to the notification. - View Dependent Claims (23, 24, 25)
-
-
26. A clustered computer system, comprising:
-
(a) first and second nodes coupled to one another over a network; and
(b) a group including first and second members, the first member resident on the first node and the second member resident on the second node, wherein the first member is configured to notify the second member in response to a clustering failure on the first node, and wherein the second member is configured to initiate a restart of the first node in response to the notification;
wherein the first member is configured to detect the clustering failure in the first node, determine whether the clustering failure occurred during a restart of the first node, and notify the second member in response to detecting the clustering failure in the first node and determining that the clustering failure did not occur during a restart of the first node. - View Dependent Claims (27, 28, 29, 30, 31)
-
-
32. A program product, comprising:
-
(a) a program configured to reside on a node in a clustered computer system, the program configured to initiate a restart of another node in the clustered computer system in response to a notification from the other node of a clustering failure on the other node, wherein the program comprises a member of a group hosted by the clustered computer system, the group including an additional member residing on the other node, wherein the notification comprises a request issued to the group by the additional member, and wherein the program is further configured to determine, in response to the request, that the node upon which the program is resident should initiate the restart of the other node, such that only one node in the clustered computer system initiates the restart of the other node in response to the request; and
(b) a signal storage medium storing the program.
-
-
33. A program product, comprising:
-
(a) first, second and third programs respectively configured to reside on first, second and third nodes in a clustered computer system, the first, second and third programs respectively operating as first, second and third members of a group, the first program configured to notify the second program and the thrid program in response to a clustering failure on the first node by issuing a request to the group that is forwarded to each of the second and third members, the second program configured to, in response to the request, determine that the second program should initiate a restart of the first node, the second program further configured to initiate the restart of the first node after determining that the second program should initiate the restart, and the third program configured to defer to the second program to initiate the restart of the first node; and
(b) at least one signal storage medium storing the first and second programs. - View Dependent Claims (34, 35)
-
Specification