Fault tolerant federation of computing clusters
First Claim
1. At a computer system that includes at least one processor, a computer-implemented method for facilitating communication and maximizing an efficiency of directed work flow between computing nodes in a cluster federation, the method comprising:
- an act of identifying a plurality of computing nodes that are to be a part of the cluster federation, the cluster federation including a master cluster and a worker cluster, wherein the master cluster includes a first master node and a second master node, and wherein the worker cluster includes a worker node;
an act of assigning a director role to the first master node, which first master node is included in the master cluster, wherein the first master node, after being assigned the director role, governs decisions that affect consistency within the cluster federation;
an act of assigning a leader role to the second master node, which second master node is also included in the master cluster, wherein the second master node, after being assigned the leader role, monitors and controls the worker node in the worker cluster, whereby the master cluster includes the first master node having the director role and the second master node having the leader role, the first master node being different than the second master node, the director role being different than the leader role;
an act of maintaining a partitioned database that is usable to facilitate communication between the first master node and the second master node, wherein any particular entry in the partitioned database is changeable by only one node included within the master cluster such that the communication between the first master node and the second master node occurs without acquiring a database lock;
an act of assigning a worker agent role to the worker node, wherein the worker node, after being assigned the worker agent role, receives and processes workload assignments from the master cluster; and
after waiting a predetermined time interval during which a status update from the worker agent role is not received, an act of the leader role communicating to the director role a failure of the worker agent role by recording the failure in the partitioned database, whereby the leader role communicates workload failures to the director role via the partitioned database.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments are directed to organizing computing nodes in a cluster federation and to reassigning roles in a cluster federation. In one scenario, a computer system identifies computing nodes that are to be part of a cluster federation which includes a master cluster and worker clusters. The computer system assigns a director role to a master node in the master cluster which governs decisions that affect consistency within the federation, and further assigns a leader role to at least one master node which monitors and controls other master nodes in the master cluster. The computer system assigns a worker agent role to a worker node which receives workload assignments from the master cluster, and further assigns a worker role to a worker node which processes the assigned workload. The organized cluster federation provides fault tolerance by allowing roles to be dynamically reassigned to computing nodes in different master and worker clusters.
-
Citations
21 Claims
-
1. At a computer system that includes at least one processor, a computer-implemented method for facilitating communication and maximizing an efficiency of directed work flow between computing nodes in a cluster federation, the method comprising:
-
an act of identifying a plurality of computing nodes that are to be a part of the cluster federation, the cluster federation including a master cluster and a worker cluster, wherein the master cluster includes a first master node and a second master node, and wherein the worker cluster includes a worker node; an act of assigning a director role to the first master node, which first master node is included in the master cluster, wherein the first master node, after being assigned the director role, governs decisions that affect consistency within the cluster federation; an act of assigning a leader role to the second master node, which second master node is also included in the master cluster, wherein the second master node, after being assigned the leader role, monitors and controls the worker node in the worker cluster, whereby the master cluster includes the first master node having the director role and the second master node having the leader role, the first master node being different than the second master node, the director role being different than the leader role; an act of maintaining a partitioned database that is usable to facilitate communication between the first master node and the second master node, wherein any particular entry in the partitioned database is changeable by only one node included within the master cluster such that the communication between the first master node and the second master node occurs without acquiring a database lock; an act of assigning a worker agent role to the worker node, wherein the worker node, after being assigned the worker agent role, receives and processes workload assignments from the master cluster; and after waiting a predetermined time interval during which a status update from the worker agent role is not received, an act of the leader role communicating to the director role a failure of the worker agent role by recording the failure in the partitioned database, whereby the leader role communicates workload failures to the director role via the partitioned database. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. At a computer system that includes at least one processor, a computer-implemented method for facilitating communication and maximizing an efficiency when reassigning roles in a cluster federation, the method comprising
an act of identifying a cluster federation that includes a master cluster and a worker cluster, wherein the master cluster includes a plurality of master nodes that each have corresponding master node functionalities, and wherein the worker cluster includes a worker node, wherein a first master node and a second master node are included in the plurality of master nodes, the first master node being assigned a director role, the second master node being assigned a leader role, whereby the master cluster includes both the first master node having the director role and the second master node having the leader role, the first master node being different than the second master node, the director role being different than the leader role; -
an act of maintaining a partitioned database that is usable to facilitate communication between at least two master nodes included within the plurality, wherein any particular entry in the partitioned database is changeable by only one master node included within the plurality such that the communication between the at least two master nodes occurs without acquiring a database lock; an act of setting a policy requirement for the cluster federation, wherein the policy requirement at least requires that the cluster federation maintain a specified number of master nodes in the master cluster; an act of determining that a current number of master nodes included within the master cluster is below the specified number of master nodes such that the cluster federation is not meeting the policy requirement; an act of determining that the worker node is available for reassignment; an act of reassigning the worker node to become a new master node, such that the worker node, after being reassigned to become the new master node, adopts the master node functionalities; an act of the new master node transmitting a workload assignment to a worker agent role in the worker cluster; and after waiting a predetermined time interval during which a status update from the worker agent role is not received, an act of the new master node communicating a failure of the worker agent role by recording the failure in the partitioned database, whereby the new master node communicates workload failures via the partitioned database. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A computer system comprising the following:
-
one or more processors; and one or more computer-readable storage media having stored thereon computer-executable instructions that are executable by the one or more processors and that cause the computer system to perform a method for facilitating communication and maximizing an efficiency when reassigning roles in a cluster federation, the method comprising the following; an act of identifying a cluster federation that includes a master cluster and a worker cluster, wherein the master cluster includes a plurality of master nodes that each include corresponding master node functionalities, and wherein the worker cluster includes a worker node with corresponding worker node functionalities, wherein a first master node and a second master node are included in the plurality of master nodes, the first master node being assigned a director role, the second master node being assigned a leader role, whereby the master cluster includes both the first master node having the director role and the second master node having the leader role, the first master node being different than the second master node, the director role being different than the leader role; an act of maintaining a partitioned database that is usable to facilitate communication between at least two master nodes included within the plurality, wherein any particular entry in the partitioned database is changeable by only one master node included within the plurality such that the communication between the at least two master nodes occurs without acquiring a database lock; an act of setting a policy requirement for the cluster federation, wherein the policy requirement at least requires that the cluster federation maintain a specified number of worker nodes in the worker cluster; an act of determining that a current number of worker nodes included within the worker cluster is below the specified number of worker nodes such that the cluster federation is not meeting the policy requirement; an act of determining that at least one master node in the plurality of master nodes is available for reassignment; an act of reassigning the at least one master node to become a new worker node, such that the at least one master node, after being reassigned to become the new worker node, adopts the worker node functionalities; an act of a different master node transmitting a workload assignment to the new worker node; and after waiting a predetermined time interval during which a status update from the new worker node is not received, an act of the different master node communicating a failure of the new worker node by recording the failure in the partitioned database, whereby the different master node communicates workload failures via the partitioned database. - View Dependent Claims (18, 19, 20, 21)
-
Specification