Map-reduce ready distributed file system

US 9,207,930 B2
Filed: 12/29/2011
Issued: 12/08/2015
Est. Priority Date: 06/19/2010
Status: Active Grant

First Claim

Patent Images

1. A map-reduce compatible distributed file system, comprising:

a container location database (CLDB) configured to maintain information about where each of a plurality of containers is located;

a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and

a plurality of inodes for structuring data within said containers;

wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain;

wherein data in the CLDB is itself stored as inodes in well known containers;

wherein said CLDB inodes are configured to maintain a database that contains at least the following information about all of said containers;

nodes that have replicas of a container;

an ordering of a replication chain for each container;

wherein updates to a container are sent to a master for said updated container;

wherein changes to content of a container are propagated to the replicas of the container by said master; and

wherein when non-empty sets of nodes containing replicas of a container fail and a set of failing nodes includes a node containing the master, the CLDB detects the loss of contact with the failing nodes and notices that the failing nodes contain a previous master;

wherein a surviving node containing a replica, if any such exists, is designated as a new master and other nodes are assigned positions in the replication structure;

wherein an epoch number for the container is incremented at the same time that the new master is designated;

wherein the new master increments the transaction identifier to guarantee that a gap occurs in a sequence of transaction identifiers;

wherein the new master then records an end of a set of transactions that were handled by the previous master in a previous epoch and records a starting point for transaction identifiers in said new epoch; and

wherein all other replicas are considered to be out-of-date.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A map-reduce compatible distributed file system that consists of successive component layers that each provide the basis on which the next layer is built provides transactional read-write-update semantics with file chunk replication and huge file-create rates. Containers provide the fundamental basis for data replication, relocation, and transactional updates. A container location database allows containers to be found among all file servers, as well as defining precedence among replicas of containers to organize transactional updates of container contents. Volumes facilitate control of data placement, creation of snapshots and mirrors, and retention of a variety of control and policy information. Also addressed is the use of distributed transactions in a map-reduce system; the use of local and distributed snapshots; replication, including techniques for reconciling the divergence of replicated data after a crash; and mirroring.

38 Citations

View as Search Results

12 Claims

1. A map-reduce compatible distributed file system, comprising:
- a container location database (CLDB) configured to maintain information about where each of a plurality of containers is located;
  
  a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and
  
  a plurality of inodes for structuring data within said containers;
  
  wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain;
  
  wherein data in the CLDB is itself stored as inodes in well known containers;
  
  wherein said CLDB inodes are configured to maintain a database that contains at least the following information about all of said containers;
  
  nodes that have replicas of a container;
  
  an ordering of a replication chain for each container;
  
  wherein updates to a container are sent to a master for said updated container;
  
  wherein changes to content of a container are propagated to the replicas of the container by said master; and
  
  wherein when non-empty sets of nodes containing replicas of a container fail and a set of failing nodes includes a node containing the master, the CLDB detects the loss of contact with the failing nodes and notices that the failing nodes contain a previous master;
  
  wherein a surviving node containing a replica, if any such exists, is designated as a new master and other nodes are assigned positions in the replication structure;
  
  wherein an epoch number for the container is incremented at the same time that the new master is designated;
  
  wherein the new master increments the transaction identifier to guarantee that a gap occurs in a sequence of transaction identifiers;
  
  wherein the new master then records an end of a set of transactions that were handled by the previous master in a previous epoch and records a starting point for transaction identifiers in said new epoch; and
  
  wherein all other replicas are considered to be out-of-date.

2. A map-reduce compatible distributed file system, comprising:
- a container location database (CLDB) configured to maintain information about where each of a plurality of containers is located;
  
  a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and
  
  a plurality of inodes for structuring data within said containers;
  
  wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain;
  
  wherein data in the CLDB is itself stored as inodes in well known containers;
  
  wherein said CLDB inodes are configured to maintain a database that contains at least the following information about all of said containers;
  
  nodes that have replicas of a container;
  
  an ordering of a replication chain for each container;
  
  wherein updates to a container are sent to a master for said updated container;
  
  wherein changes to content of a container are propagated to the replicas of the container by said master; and
  
  wherein when a number of nodes containing replicas of a container come into service at about the same time, said nodes contact the CLDB as they come up and report a last epoch that they have seen, as well as a last update identifier that they have recorded;
  
  wherein until a configurable minimum number of nodes comes up, no updates are allowed and a master is not designated, although read operations may be allowed;
  
  wherein once a minimum number of nodes have reported to the CLDB and a configurable delay has passed to allow any slow nodes to reappear, the CLDB selects one of the nodes as master and arranges the others in a replication structure;
  
  wherein the node selected as master is the node that reports that it has seen the most recent epoch and transaction identifier; and
  
  wherein all other replicas are considered to be out-of-date.

3. A map-reduce compatible distributed file system, comprising:
- a container location database (CLDB) configured to maintain information about where each of a plurality of containers is located;
  
  a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and
  
  a plurality of inodes for structuring data within said containers;
  
  wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain;
  
  wherein data in the CLDB is itself stored as inodes in well known containers;
  
  wherein said CLDB inodes are configured to maintain a database that contains at least the following information about all of said containers;
  
  nodes that have replicas of a container;
  
  an ordering of a replication chain for each container;
  
  wherein updates to a container are sent to a master for said updated container;
  
  wherein changes to content of a container are propagated to the replicas of the container by said master;
  
  wherein said CLDB stores a location of all replicas of a container, a structure of a replication for the container, and an epoch number for each container, wherein said epoch number is incremented each time the replication structure for a container is changed; and
  
  wherein an epoch'"'"'s changes are noted in a transaction history for each version of the container and gaps are inserted whenever the master is noted in a replication chain; and
  
  wherein said CLDB traces back through transactions that have been applied to each copy when examining the master and target copies of a same container to determine a point in a history of the two containers when they were identical.
- View Dependent Claims (4, 5, 6, 7, 8, 9, 10)
- - 4. The system of claim 3, wherein a set of committed transactions are run-length encoded;
    - wherein an on-disk state is updated only when a new epoch is begun as a new master is designated; and
      
      wherein transactions are identified by the epoch in which they were performed, as well as by a transaction counter in the container.
  - 5. The system of claim 3, wherein a target container is brought up to date relative to the master by undoing any changes made to a secondary version of the container because changes on the target may not all have been applied to the master.
  - 6. The system of claim 5, wherein once a common ancestor transaction is identified, a secondary version of the container is scanned to find all blocks that have been modified since the common ancestor transaction;
    - wherein as these blocks are found, the target over-writes them by requesting these same blocks from the master;
      
      wherein all blocks in the target container are left either in a state that they had as of the common ancestor state or in a later state present on the master version of the container; and
      
      wherein all blocks are updated to a current state of the master version of the container.
  - 7. The system of claim 6, wherein after a target has been brought back to a common ancestor state, the master is scanned to find all blocks that have been updated since the common ancestor transaction;
    - andwherein said blocks are copied to the target copy of the container.
  - 8. The system of claim 7, wherein a snapshot of the master container that is used as a source of blocks and said update are each repeated until the target copy is less than a configurable number of transactions behind the master.
  - 9. The system of claim 8, wherein once a secondary copy of the container is a predetermined distance from the master, updates to the master are frozen while a final update is done to the secondary.
  - 10. The system of claim 9, wherein when a copy process converges, a container identifier table in a storage pool containing the target is modified to swap identifiers for the target container and a clone of the target container is used for the copy operation;
    - andwherein an original container is then discarded.

11. A map-reduce compatible distributed file system, comprising:
- a container location database (CLDB) configured to maintain information about where each of a plurality of containers is located;
  
  a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and
  
  a plurality of inodes for structuring data within said containers;
  
  wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain;
  
  wherein data in the CLDB is itself stored as inodes in well known containers;
  
  wherein said CLDB inodes are configured to maintain a database that contains at least the following information about all of said containers;
  
  nodes that have replicas of a container;
  
  an ordering of a replication chain for each container;
  
  wherein updates to a container are sent to a master for said updated container;
  
  wherein changes to content of a container are propagated to the replicas of the container by said master;
  
  wherein a transaction is performed in a manner that is unaffected by failures;
  
  wherein a master node writes data on a subsidiary node B to a transaction log;
  
  wherein a transaction is written into an orphanage that reverses the effect of a write operation if a reference on a subsidiary node A is not found at a later time;
  
  wherein said subsidiary node B returns a reference to the written data to the master node;
  
  wherein said reference is sent to said subsidiary node A;
  
  wherein when said data was created on node B, a background thread was started or a cleanup event was scheduled causing said node B to inspect said orphanage at a time substantially after the original write occurs;
  
  wherein said orphanage entry causes node B to inspect node A or one of the replicas of node A to see if a reference to the data written on node B exists;
  
  wherein if a reference does exist, then no action is taken;
  
  wherein if a reference does not exist, said orphanage entry created in said transaction on node B is executed, which reverses the effect of the original writing of the data; and
  
  wherein if a reference on node A is never created, then new data on node B is never accessible, such that said reference and said data appear atomically or not at all.

12. A map-reduce compatible distributed file system, comprising:
- a container location database (CLDB) configured to maintain information about where each of a plurality of containers is located;
  
  a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and
  
  a plurality of inodes for structuring data within said containers;
  
  wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain;
  
  wherein data in the CLDB is itself stored as inodes in well known containers;
  
  wherein said CLDB inodes are configured to maintain a database that contains at least the following information about all of said containers;
  
  nodes that have replicas of a container;
  
  an ordering of a replication chain for each container;
  
  wherein updates to a container are sent to a master for said updated container; and
  
  wherein changes to content of a container are propagated to the replicas of the container by said master; and
  
  further comprising;
  
  a distributed transaction in the form of a snapshot of a file system volume, consisting of directories and files spread over a number of containers;
  
  wherein all data and meta-data for a volume is organized into a single name container and zero or more data containers;
  
  wherein all cross-container references to data are segregated into the name container while keeping all of the data in data containerswherein all references from one data container to another data container are mediated by data structures in the name volume;
  
  wherein a volume snapshot proceeds by creating a snapshot of the name container and then creating snapshots of the data containers;
  
  wherein data structures that are inserted into the data containers are limited to references to such data structures that are inserted into the data containers that are from data structures that name the container;
  
  wherein once snapshots of all containers in a volume are created, a table is created that maps identifiers of the original containers to identifiers of corresponding snapshots;
  
  wherein said table is injected into at least the name container snapshot and, optionally, into all of the snapshot containers;
  
  wherein said table is used to translate references to any of the original containers into references to the corresponding snapshot container; and
  
  wherein as modifications are made to data in the original volume, changed disk blocks are written to fresh storage and modified data structures are copied to new locations because the copy on write bit is set, leaving data in the snapshot intact.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
MapR Technologies, Inc. (Hewlett-Packard Enterprise Company)
Inventors
Saradhi, Uppaluri Vijaya, Renu, Lohit Vijaya, Vellanki, Vivekanand, Kavacheri, Sathya, Hadke, Amit, Srivas, Mandayam C., Ravindra, Pindikura, Pande, Arvind Arun, Sanapala, Chandra Guru Kiran Babu
Primary Examiner(s)
Mahmoudi, Tony
Assistant Examiner(s)
NGUYEN, MERILYN P

Application Number

US13/340,532
Publication Number

US 20120101991A1
Time in Patent Office

1,440 Days
Field of Search

707/623
US Class Current

1/1
CPC Class Codes

G06F 16/10   File systems; File servers

G06F 16/128   Details of file system snap...

G06F 16/178   Techniques for file synchro...

G06F 16/182   Distributed file systems

G06F 16/1844   Management specifically ada...

G06F 16/1865   Transactional file systems

G06F 16/22   Indexing; Data structures t...

G06F 16/2246   Trees, e.g. B+trees

G06F 16/23   Updating

G06F 16/235   Update request formulation

G06F 16/2365   Ensuring data consistency a...

G06F 16/27   Replication, distribution o...

G06F 16/273   Asynchronous replication or...

G06F 16/275   Synchronous replication

G06F 8/658   Incremental updates; Differ...

H04L 65/102   Gateways arrangements for c...

Map-reduce ready distributed file system

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

38 Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Map-reduce ready distributed file system

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

38 Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links