Map-reduce ready distributed file system

US 9,323,775 B2
Filed: 06/16/2011
Issued: 04/26/2016
Est. Priority Date: 06/19/2010
Status: Active Grant

First Claim

Patent Images

1. A map-reduce compatible distributed file system, comprising:

a plurality of containers in which each container stores file and directory meta-data as well as file content data;

wherein references to file content data are stored on a subset of nodes on which container meta-data and data are stored; and

wherein container data and meta-data are arranged to allow a topological sort to imply update order;

a container location database (CLDB) configured to maintain information about where each of said plurality of containers is located;

a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and

a plurality of inodes for structuring data within said containers;

wherein said CLDB is configured to assign nodes as replicas of data in a container to meet policy constraints in accordance with any of the following;

said CLDB assigns each container a master node that controls all transactions for that container;

said CLDB designates a chain of nodes to hold replicas;

when one of the replicas goes down or is separated from the master CLDB node, it is removed from the replication chain;

when the master goes down or is separated, a new master is designated;

any node that comes back after having been removed from the replication chain is reinserted at the end of the replication chain when the chain still needs another replication chain when the node returns;

when the node returns within a first predetermined interval, no new node to replicate the container in question has been designated and the chain still needs a replication chain; and

when the node has been gone for a second, longer predetermined interval, the CLDB may designate some other node to take a place in the chain.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A map-reduce compatible distributed file system that consists of successive component layers that each provide the basis on which the next layer is built provides transactional read-write-update semantics with file chunk replication and huge file-create rates. A primitive storage layer (storage pools) knits together raw block stores and provides a storage mechanism for containers and transaction logs. Storage pools are manipulated by individual file servers. Containers provide the fundamental basis for data replication, relocation, and transactional updates. A container location database allows containers to be found among all file servers, as well as defining precedence among replicas of containers to organize transactional updates of container contents. Volumes facilitate control of data placement, creation of snapshots and mirrors, and retention of a variety of control and policy information. Key-value stores relate keys to data for such purposes as directories, container location maps, and offset maps in compressed files.

Citations

19 Claims

1. A map-reduce compatible distributed file system, comprising:
- a plurality of containers in which each container stores file and directory meta-data as well as file content data;
  
  wherein references to file content data are stored on a subset of nodes on which container meta-data and data are stored; and
  
  wherein container data and meta-data are arranged to allow a topological sort to imply update order;
  
  a container location database (CLDB) configured to maintain information about where each of said plurality of containers is located;
  
  a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and
  
  a plurality of inodes for structuring data within said containers;
  
  wherein said CLDB is configured to assign nodes as replicas of data in a container to meet policy constraints in accordance with any of the following;
  
  said CLDB assigns each container a master node that controls all transactions for that container;
  
  said CLDB designates a chain of nodes to hold replicas;
  
  when one of the replicas goes down or is separated from the master CLDB node, it is removed from the replication chain;
  
  when the master goes down or is separated, a new master is designated;
  
  any node that comes back after having been removed from the replication chain is reinserted at the end of the replication chain when the chain still needs another replication chain when the node returns;
  
  when the node returns within a first predetermined interval, no new node to replicate the container in question has been designated and the chain still needs a replication chain; and
  
  when the node has been gone for a second, longer predetermined interval, the CLDB may designate some other node to take a place in the chain.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The distributed file system of claim 1, wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain.
  - 3. The distributed file system of claim 1, wherein said CLDB is maintained by a plurality of redundant servers;
    - and wherein data in the CLDB is itself stored as inodes in well known containers.
  - 4. The distributed file system of claim 3, wherein said CLDB nodes are configured to maintain a database that contains at least the following information about all of said containers:
    - nodes that have replicas of a container; and
      
      an ordering of a replication chain for each container.
  - 5. The distributed file system of claim 1, wherein container master is configured to control updates to replication chains transactionally.
  - 6. The distributed file system of claim 1, wherein all inode data structures and indirect data b-trees comprise version numbers that facilitate updating container replicas that have missed transactions.
  - 7. The distributed file system of claim 1, wherein data is stored in the distributed file system on multiple block-addressable data stores that comprise block devices that represent any of entire disks, flash memory systems, partitions of either of these, and individual files stored in a conventional file system;
    - wherein each data store supports random reading and writing of relatively small, fixed-size blocks of data.
  - 8. The distributed file system of claim 1, wherein each said storage pool comprises:
    - a plurality of bitmap extents, a plurality of log extents, and a map of CID to container disk offset, each of which is stored in a super block that is replicated to several well known locations in the storage pool;
      
      wherein said bitmap extents comprise pointers to multiple block allocation bitmaps for the storage pool;
      
      wherein said log extents comprise pointers to portions of the storage pool that are used to store transaction logs for the storage pool; and
      
      wherein said map of container id (CID) to disk offsets comprises a mechanism for looking up container IDs to find disk offsets in the storage pool.
  - 9. The distributed file system of claim 1, further comprising:
    - a plurality of file identifiers (FID), each FID referring to an inode in a particular container, each FID comprising a container id, an inode number, and an integer chosen to make contents of the FID unique, even if an inode is re-used for a different purpose.
  - 10. The distributed file system of claim 1, each inode further comprising:
    - a composite data structure that contains attributes that describe various aspects of each object including any of owner, permissions, parent FID, object type, and size;
      
      wherein object type comprises any of a local file, chunked file, directory, key-value store, symbolic link, or volume mount point;
      
      wherein said inode further comprises pointers to disk blocks that contain a first set of bytes of data in the object;
      
      wherein each of said pointers comprises an associated copy-on-write bit stored with said pointers;
      
      wherein said inode further comprises references to indirect data which, in the case of local files can also comprise a pointer to a B+ tree that contains the object data, along with a copy-on-write bit for that tree and, in the case of a chunked file, a pointer to a local file, referred to as a FID map, that contains FID'"'"'s that refer to local files in other containers containing content of the file;
      
      wherein said inode further comprises a cache of a latest version number for any structure referenced from the inode; and
      
      wherein said version number is configured for use in replication and mirroring.
  - 11. The distributed file system of claim 10, wherein said chunked file comprises a file that is made up of chunks stored in many containers, where each chunk is represented as a local file and references from a chunked file inode lead to an array of references to these local files.
  - 12. The distributed file system of claim 10, wherein said symbolic link is stored as a local file that contains the name of a file and can point to any distributed file system object.
  - 13. The distributed file system of claim 10, wherein a volume mount is stored as a local file that contains a name of a volume to be mounted.
  - 14. The distributed file system of claim 1, wherein said distributed file system is configured as a read-write access file system, wherein random updates and reads occur from any node in a cluster and/or from any device that has unfettered access to other devices in the cluster.
  - 15. The distributed file system of claim 1, wherein said distributed file system is configured for stateless access.
  - 16. The distributed file system of claim 15, further comprising:
    - at least one NFS gateway;
      
      wherein said distributed file system is configured for access via NFS network protocols.
  - 17. The distributed file system of claim 16, further comprising:
    - a plurality of NFS gateways; and
      
      a coordination server by which said NFS gateways cooperatively decide which of said NFS gateways host which IP addresses.
  - 18. The distributed file system of claim 17, all NFS servers can access all files in the distributed file system.
  - 19. The distributed file system of claim 17, wherein all file names accessed via the distributed file system start with a common prefix followed by a cluster name and a name of a file within said cluster;
    - wherein said NFS gateways are configured to populate a top-level virtual directory associated with said common prefix with virtual files corresponding to each accessible cluster.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
MapR Technologies, Inc. (Hewlett-Packard Enterprise Company)
Inventors
Srivas, Mandayam C., Ravindra, Pindikura, Saradhi, Uppaluri Vijaya, Pande, Arvind Arun, Renu, Lohit Vijaya, Vellanki, Vivekanand, Kavacheri, Sathya, Hadke, Amit Ashoke, Sanapala, Chandra Guru Kiran Babu
Primary Examiner(s)
Arjomandi, Noosha

Application Number

US13/162,439
Publication Number

US 20110313973A1
Time in Patent Office

1,776 Days
Field of Search

707/770, 707/634
US Class Current

1/1
CPC Class Codes

G06F 16/10   File systems; File servers

G06F 16/128   Details of file system snap...

G06F 16/178   Techniques for file synchro...

G06F 16/182   Distributed file systems

G06F 16/1844   Management specifically ada...

G06F 16/1865   Transactional file systems

G06F 16/22   Indexing; Data structures t...

G06F 16/2246   Trees, e.g. B+trees

G06F 16/23   Updating

G06F 16/235   Update request formulation

G06F 16/2365   Ensuring data consistency a...

G06F 16/27   Replication, distribution o...

G06F 16/273   Asynchronous replication or...

G06F 16/275   Synchronous replication

G06F 8/658   Incremental updates; Differ...

H04L 65/102   Gateways arrangements for c...

Map-reduce ready distributed file system

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Map-reduce ready distributed file system

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links