Map-reduce ready distributed file system
First Claim
1. A map-reduce compatible distributed file system, comprising:
- a container location database (CLDB) configured to maintain information about where each of a plurality of containers is located;
a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and
a plurality of inodes for structuring data within said containers;
wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain;
wherein data in the CLDB is itself stored as inodes in well known containers;
wherein said CLDB inodes are configured to maintain a database that contains at least the following information about all of said containers;
nodes that have replicas of a container;
an ordering of a replication chain for each container;
wherein updates to a container are sent to a master for said updated container;
wherein changes to content of a container are propagated to the replicas of the container by said master; and
wherein when non-empty sets of nodes containing replicas of a container fail and a set of failing nodes includes a node containing the master, the CLDB detects the loss of contact with the failing nodes and notices that the failing nodes contain a previous master;
wherein a surviving node containing a replica, if any such exists, is designated as a new master and other nodes are assigned positions in the replication structure;
wherein an epoch number for the container is incremented at the same time that the new master is designated;
wherein the new master increments the transaction identifier to guarantee that a gap occurs in a sequence of transaction identifiers;
wherein the new master then records an end of a set of transactions that were handled by the previous master in a previous epoch and records a starting point for transaction identifiers in said new epoch; and
wherein all other replicas are considered to be out-of-date.
7 Assignments
0 Petitions
Accused Products
Abstract
A map-reduce compatible distributed file system that consists of successive component layers that each provide the basis on which the next layer is built provides transactional read-write-update semantics with file chunk replication and huge file-create rates. Containers provide the fundamental basis for data replication, relocation, and transactional updates. A container location database allows containers to be found among all file servers, as well as defining precedence among replicas of containers to organize transactional updates of container contents. Volumes facilitate control of data placement, creation of snapshots and mirrors, and retention of a variety of control and policy information. Also addressed is the use of distributed transactions in a map-reduce system; the use of local and distributed snapshots; replication, including techniques for reconciling the divergence of replicated data after a crash; and mirroring.
38 Citations
12 Claims
-
1. A map-reduce compatible distributed file system, comprising:
-
a container location database (CLDB) configured to maintain information about where each of a plurality of containers is located; a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and a plurality of inodes for structuring data within said containers; wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain; wherein data in the CLDB is itself stored as inodes in well known containers; wherein said CLDB inodes are configured to maintain a database that contains at least the following information about all of said containers; nodes that have replicas of a container; an ordering of a replication chain for each container; wherein updates to a container are sent to a master for said updated container; wherein changes to content of a container are propagated to the replicas of the container by said master; and wherein when non-empty sets of nodes containing replicas of a container fail and a set of failing nodes includes a node containing the master, the CLDB detects the loss of contact with the failing nodes and notices that the failing nodes contain a previous master; wherein a surviving node containing a replica, if any such exists, is designated as a new master and other nodes are assigned positions in the replication structure; wherein an epoch number for the container is incremented at the same time that the new master is designated; wherein the new master increments the transaction identifier to guarantee that a gap occurs in a sequence of transaction identifiers; wherein the new master then records an end of a set of transactions that were handled by the previous master in a previous epoch and records a starting point for transaction identifiers in said new epoch; and wherein all other replicas are considered to be out-of-date.
-
-
2. A map-reduce compatible distributed file system, comprising:
-
a container location database (CLDB) configured to maintain information about where each of a plurality of containers is located; a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and a plurality of inodes for structuring data within said containers; wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain; wherein data in the CLDB is itself stored as inodes in well known containers; wherein said CLDB inodes are configured to maintain a database that contains at least the following information about all of said containers; nodes that have replicas of a container; an ordering of a replication chain for each container; wherein updates to a container are sent to a master for said updated container; wherein changes to content of a container are propagated to the replicas of the container by said master; and wherein when a number of nodes containing replicas of a container come into service at about the same time, said nodes contact the CLDB as they come up and report a last epoch that they have seen, as well as a last update identifier that they have recorded; wherein until a configurable minimum number of nodes comes up, no updates are allowed and a master is not designated, although read operations may be allowed; wherein once a minimum number of nodes have reported to the CLDB and a configurable delay has passed to allow any slow nodes to reappear, the CLDB selects one of the nodes as master and arranges the others in a replication structure; wherein the node selected as master is the node that reports that it has seen the most recent epoch and transaction identifier; and wherein all other replicas are considered to be out-of-date.
-
-
3. A map-reduce compatible distributed file system, comprising:
-
a container location database (CLDB) configured to maintain information about where each of a plurality of containers is located; a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and a plurality of inodes for structuring data within said containers; wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain; wherein data in the CLDB is itself stored as inodes in well known containers; wherein said CLDB inodes are configured to maintain a database that contains at least the following information about all of said containers; nodes that have replicas of a container; an ordering of a replication chain for each container; wherein updates to a container are sent to a master for said updated container; wherein changes to content of a container are propagated to the replicas of the container by said master; wherein said CLDB stores a location of all replicas of a container, a structure of a replication for the container, and an epoch number for each container, wherein said epoch number is incremented each time the replication structure for a container is changed; and wherein an epoch'"'"'s changes are noted in a transaction history for each version of the container and gaps are inserted whenever the master is noted in a replication chain; and wherein said CLDB traces back through transactions that have been applied to each copy when examining the master and target copies of a same container to determine a point in a history of the two containers when they were identical. - View Dependent Claims (4, 5, 6, 7, 8, 9, 10)
-
-
11. A map-reduce compatible distributed file system, comprising:
-
a container location database (CLDB) configured to maintain information about where each of a plurality of containers is located; a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and a plurality of inodes for structuring data within said containers; wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain; wherein data in the CLDB is itself stored as inodes in well known containers; wherein said CLDB inodes are configured to maintain a database that contains at least the following information about all of said containers; nodes that have replicas of a container; an ordering of a replication chain for each container; wherein updates to a container are sent to a master for said updated container; wherein changes to content of a container are propagated to the replicas of the container by said master; wherein a transaction is performed in a manner that is unaffected by failures; wherein a master node writes data on a subsidiary node B to a transaction log; wherein a transaction is written into an orphanage that reverses the effect of a write operation if a reference on a subsidiary node A is not found at a later time; wherein said subsidiary node B returns a reference to the written data to the master node; wherein said reference is sent to said subsidiary node A; wherein when said data was created on node B, a background thread was started or a cleanup event was scheduled causing said node B to inspect said orphanage at a time substantially after the original write occurs; wherein said orphanage entry causes node B to inspect node A or one of the replicas of node A to see if a reference to the data written on node B exists; wherein if a reference does exist, then no action is taken; wherein if a reference does not exist, said orphanage entry created in said transaction on node B is executed, which reverses the effect of the original writing of the data; and wherein if a reference on node A is never created, then new data on node B is never accessible, such that said reference and said data appear atomically or not at all.
-
-
12. A map-reduce compatible distributed file system, comprising:
-
a container location database (CLDB) configured to maintain information about where each of a plurality of containers is located; a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and a plurality of inodes for structuring data within said containers; wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain; wherein data in the CLDB is itself stored as inodes in well known containers; wherein said CLDB inodes are configured to maintain a database that contains at least the following information about all of said containers; nodes that have replicas of a container; an ordering of a replication chain for each container; wherein updates to a container are sent to a master for said updated container; and wherein changes to content of a container are propagated to the replicas of the container by said master; and
further comprising;a distributed transaction in the form of a snapshot of a file system volume, consisting of directories and files spread over a number of containers; wherein all data and meta-data for a volume is organized into a single name container and zero or more data containers; wherein all cross-container references to data are segregated into the name container while keeping all of the data in data containers wherein all references from one data container to another data container are mediated by data structures in the name volume; wherein a volume snapshot proceeds by creating a snapshot of the name container and then creating snapshots of the data containers; wherein data structures that are inserted into the data containers are limited to references to such data structures that are inserted into the data containers that are from data structures that name the container; wherein once snapshots of all containers in a volume are created, a table is created that maps identifiers of the original containers to identifiers of corresponding snapshots; wherein said table is injected into at least the name container snapshot and, optionally, into all of the snapshot containers; wherein said table is used to translate references to any of the original containers into references to the corresponding snapshot container; and wherein as modifications are made to data in the original volume, changed disk blocks are written to fresh storage and modified data structures are copied to new locations because the copy on write bit is set, leaving data in the snapshot intact.
-
Specification