Distributed file system using consensus nodes
First Claim
1. A computer-implemented method of implementing a distributed file system comprising a cluster comprising a plurality of data nodes configured to store data blocks of client files, the method comprising:
- coupling a first active metadata server computer to the plurality of data nodes of the cluster via a network and configuring the first active metadata server computer to store a first replica of a namespace of the cluster in a first memory;
receiving via the network and executing, by the first active metadata server computer, a first request from a first client computer of the cluster to make a first change to a state of the first replica of the namespace of the cluster;
coupling a second active metadata server computer to the plurality of data nodes of the cluster and configuring the second active metadata server computer to store a second replica of a namespace of the cluster in a second memory;
receiving via the network and executing, by the second active metadata server computer, a second request from a second client computer of the cluster to make a second change to a state of the second replica of the namespace of the cluster while the first active metadata server computer is executing the first request from the first client computer of the cluster to make the first change to the state of the first replica of the namespace;
receiving, by a coordination engine computer, the first and second requests, serializing the received requests and generating an ordered global sequence of agreements that specifies an order in which the first and second active metadata server computers are to change the stored first and second replicas of the state of the namespace of the cluster, respectively;
maintaining the first and second replicas of the namespace of the cluster consistent with one another by configuring the first and second active metadata server computers to make changes to the stored state of the first and second replicas of the namespace of the cluster only after having received the ordered global sequence of agreements and in the order specified by the received ordered global sequence of agreements; and
setting a communication timeout configured such that, if one of the first and second active metadata server computers having received a client request has not responded thereto at an expiry of the communication timeout, the client request is transmitted to the other one of the first and second active metadata server computers.
3 Assignments
0 Petitions
Accused Products
Abstract
A cluster of nodes in a distributed file system may include; at least two namenodes, each coupled to a plurality of data nodes and each configured to store a state of a namespace of the cluster and each being configured to respond to a request from a client while other(s) of the namenodes are responding to other requests from other clients; and a coordination engine coupled to each of the namenodes. The coordination engine may be configured to receive proposals from the namenodes to change the state of the namespace by replicating, deleting and/or adding data blocks stored in the data nodes and to generate, in response, an ordered set of agreements that specifies an order in which the namenodes are to change the state of the namespace. The namenodes are configured to delay making changes thereto until after the ordered set of agreements is received from the coordination engine.
-
Citations
11 Claims
-
1. A computer-implemented method of implementing a distributed file system comprising a cluster comprising a plurality of data nodes configured to store data blocks of client files, the method comprising:
-
coupling a first active metadata server computer to the plurality of data nodes of the cluster via a network and configuring the first active metadata server computer to store a first replica of a namespace of the cluster in a first memory; receiving via the network and executing, by the first active metadata server computer, a first request from a first client computer of the cluster to make a first change to a state of the first replica of the namespace of the cluster; coupling a second active metadata server computer to the plurality of data nodes of the cluster and configuring the second active metadata server computer to store a second replica of a namespace of the cluster in a second memory; receiving via the network and executing, by the second active metadata server computer, a second request from a second client computer of the cluster to make a second change to a state of the second replica of the namespace of the cluster while the first active metadata server computer is executing the first request from the first client computer of the cluster to make the first change to the state of the first replica of the namespace; receiving, by a coordination engine computer, the first and second requests, serializing the received requests and generating an ordered global sequence of agreements that specifies an order in which the first and second active metadata server computers are to change the stored first and second replicas of the state of the namespace of the cluster, respectively; maintaining the first and second replicas of the namespace of the cluster consistent with one another by configuring the first and second active metadata server computers to make changes to the stored state of the first and second replicas of the namespace of the cluster only after having received the ordered global sequence of agreements and in the order specified by the received ordered global sequence of agreements; and setting a communication timeout configured such that, if one of the first and second active metadata server computers having received a client request has not responded thereto at an expiry of the communication timeout, the client request is transmitted to the other one of the first and second active metadata server computers. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A cluster of nodes comprising computing devices configured to implement a distributed file system, comprising:
-
a plurality of data nodes, each data node configured to store data blocks of client files; a first active hardware metadata server computer that comprises a first memory and that is coupled to the plurality of data nodes, the first active hardware metadata server computer being configured to store a first replica of a namespace of the cluster in the first memory and to receive and execute a first request from a first client computer of the cluster to make a first change to a state of the first replica of the namespace; and a second active hardware metadata server computer that comprises a second memory and that is coupled to the plurality of data nodes, the second active hardware metadata server computer being configured to store a second replica of the namespace of the cluster in the second memory and to receive via the network and execute a second request from a second client computer of the cluster to make a second change to the state of the second replica of the namespace while the first active hardware metadata server computer is executing the first request from the first client computer of the cluster, and a coordination engine computer configured to receive and to serialize the first and second requests to generate an ordered global sequence of agreements that specifies an order in which the first and second active hardware metadata server computers are to chance the first and second replicas of the state of the namespace of the cluster, respectively; wherein the first and second active hardware metadata servers computers are further configured to maintain the first and second replicas consistent with one another by configuring the first and second active hardware metadata server computers to make changes to the state of the first and second replicas of the namespace of the cluster only after having received the ordered global sequence of agreements and in the order specified by the ordered global sequence of agreements; and setting a communication timeout configured such that, if one of the first and second active hardware metadata server computers having received a client request has not responded thereto at an expiry of the communication timeout, the client request is transmitted to the other one of the first and second active hardware metadata server computers.
-
-
10. A non-transitory machine-readable medium having data stored thereon representing sequences of instructions which, when executed by computing devices, cause the computing devices to implement a distributed file system by:
-
coupling a first active metadata server computer to the plurality of data nodes of the cluster via a network and configuring the first active metadata server computer to store a first replica of a namespace of the cluster in a first memory; receiving via the network and executing, by the first active metadata server computer, a first request from a first client computer of the cluster to make a first change to a state of the first replica of the namespace of the cluster; coupling a second active metadata server computer to the plurality of data nodes of the cluster and configuring the second active metadata server computer to store a second replica of a namespace of the cluster in a second memory; receiving via the network and executing, by the second active metadata server computer, a second request from a second client computer of the cluster to make a second change to a state of the second replica of the namespace of the cluster while the first active metadata server computer is executing the first request from the first client computer of the cluster to make the first change to the state of the first replica of the namespace; receiving, by a coordination engine computer, the first and second requests, serializing the received requests and generating an ordered global sequence of agreements that specifies an order in which the first and second active metadata server computers are to change the first and second replicas of the state of the namespace of the cluster, respectively; maintaining the first and second replicas of the namespace of the cluster consistent with one another by configuring the first and second active metadata server computers to make changes to the state of the first and second replicas of the namespace of the cluster only after having received the ordered global sequence of agreements and in the order specified by the received ordered global sequence of agreements, and setting a communication timeout configured such that, if one of the first and second active metadata server computers having received a client request has not responded thereto at an expiry of the communication timeout, the client request is transmitted to the other one of the first and second active metadata server computers.
-
-
11. A computer-implemented method of implementing a distributed file system comprising a cluster comprising a plurality of data nodes configured to store data blocks of client files, the method comprising:
-
receiving via a network at least one first proposal from a first active metadata server computer in the cluster to change a state of a first replica of a namespace of the cluster, the at least one first proposal being generated by the first active metadata server computer after having received at least one first request to change the state of the first replica of the namespace from at least one client computer; receiving via the network at least one second proposal from a second active metadata server computer in the cluster to change a state of a second replica of a namespace of the cluster, the at least one second proposal being generated by the second active metadata server computer after having received at least one second request to change the state of the second replica of the namespace from at least one client computer; generating, in response to receiving the at least one first proposal from the first active metadata server computer and in response to receiving the at least one second proposal from the second active metadata server computer, an ordered global sequence of agreements that specifies an order in which the first and second active metadata server computers are to change the state of the first and second replicas of the cluster, respectively, to implement at least some of the at least one first and at least one second requests to change the state of the first and second replicas of the namespace by at least one of replicating, deleting and adding data blocks in at least one of the plurality of data nodes; setting a communication timeout configured such that, if one of the first and second active metadata server computers having received a client request has not responded thereto at an expiry of the communication timeout, the client request is transmitted to the other one of the first and second active metadata server computers; communicating the ordered global sequence of agreements to the first active metadata server; communicating the ordered global sequence of agreements to the second active metadata server, the communicated ordered global sequence of agreements constraining the first and second active metadata server computers to only make changes to the state of the first and second replicas of the namespace of the cluster after having received the ordered global sequence of agreements and in the order specified by the received ordered global sequence of agreements.
-
Specification