Automatic clusterwide fail-back
First Claim
Patent Images
1. A method comprising:
- determining that access by a first node in a computation cluster through a first controller to a storage resource has been restored;
querying at least one of a plurality of active nodes in the computation cluster, whereinthe querying is performed in response to the determining that access by the first node through the first controller to the storage resource has been restored, andthe active nodes in the computation cluster are nodes in the computation cluster that depend on access to the storage resource at the time of the querying, and for which the first controller is a fail-back target;
receiving confirmation, in response to the querying, that access to the storage resource through the first controller has been restored for each of the active nodes in the computation cluster; and
instructing, based at least in part on the confirmation, the first node to communicate with the storage resource through the first controller.
6 Assignments
0 Petitions
Accused Products
Abstract
Systems and procedures may be used to coordinate the fail-back of multiple hosts in environments where the hosts share one or more data-storage resources. In one implementation, a procedure for coordinating fail-backs includes monitoring a failed data path to detect a restoration of the data path, polling remaining nodes in response to the restoration, and allowing the first node to resume communications if access has been restored to the remaining nodes.
-
Citations
20 Claims
-
1. A method comprising:
-
determining that access by a first node in a computation cluster through a first controller to a storage resource has been restored; querying at least one of a plurality of active nodes in the computation cluster, wherein the querying is performed in response to the determining that access by the first node through the first controller to the storage resource has been restored, and the active nodes in the computation cluster are nodes in the computation cluster that depend on access to the storage resource at the time of the querying, and for which the first controller is a fail-back target; receiving confirmation, in response to the querying, that access to the storage resource through the first controller has been restored for each of the active nodes in the computation cluster; and instructing, based at least in part on the confirmation, the first node to communicate with the storage resource through the first controller. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method comprising:
-
determining that a first data path is available for use by a first node in a computation cluster, wherein the computation cluster comprises a plurality of active nodes, the active nodes are nodes that depend on access to a destination, and for which the first data path is a fail-back target; and the first data path was unavailable to the first node prior to the determining; informing a master node in the computation cluster that the first data path is available for use by the first node; abstaining from communicating between the first node and the destination over the first data path until a fail-back approval is received from the master node, wherein the abstaining occurs after the determining that the first data path is available for use by the first node; the fail-back approval is received after the informing, and the fail-back approval is based at least in part on a determination that the first data path is available for use by each of the plurality of active nodes; and communicating between the first node and the destination after the fail-back approval is received from the master node. - View Dependent Claims (12, 13, 14)
-
-
15. A system comprising:
-
a first host comprising a restoration detection module configured to monitor a failed communications path to a storage resource, and a query module configured to transmit a request for approval for a fail-back, wherein a plurality of hosts are coupled to the failed communications path, and the query module is configured to transmit the request in response to the restoration detection module detecting a restoration of the failed communications path; and a master host coupled to the first host, the master host comprising a coordination module configured to receive the request from the first host, determine, in response to the request, whether each active host in the plurality of hosts is ready to perform a fail-back, wherein the active hosts are hosts that depend on access to the storage resource, and transmit a fail-back approval to the first host, wherein the fail-back approval is based at least in part the request, and the first host is configured to abstain from communicating over the failed communications path until receiving the fail-back approval. - View Dependent Claims (16, 17, 18)
-
-
19. A non-transient computer readable medium having encoded therein instructions executable on one or more processors, wherein the instructions are configured to implement each of:
-
determining that access by a first node in a computation cluster through a first controller to a storage resource has been restored; querying at least one of a plurality of active nodes in the computation cluster, wherein the querying is performed in response to the determining that access by the first node through the first controller to the storage resource has been restored, and the active nodes in the computation cluster are nodes in the computation cluster that depend on access to the storage resource at the time of the querying, and for which the first controller is a fail-back target; and receiving confirmation, in response to the querying, that access to the storage resource through the first controller has been restored for each of the active nodes in the computation cluster. - View Dependent Claims (20)
-
Specification