Map-reduce ready distributed file system

US 9,646,024 B2
Filed: 04/21/2016
Issued: 05/09/2017
Est. Priority Date: 06/19/2010
Status: Active Grant

First Claim

Patent Images

1. A distributed file system comprising:

a processor, said processor implementing a plurality of storage pools that bind raw block stores together and that provide a storage mechanism for containers and transaction logs;

said processor implementing a plurality of containers configured for any of data replication, relocation, and transactional updates; and

a container location database configured to locate specific containers within a plurality of file servers, and with which precedence among replicas of containers is defined to organize transactional updates of container contents;

wherein each said storage pool comprises a plurality of bitmap extents, a plurality of log extents, and a map of container id (CID) to container disk offset, each of which is stored in a super block that is replicated to several well-known locations in the storage pool;

wherein said bitmap extents comprise pointers to multiple block allocation bitmaps for the storage pool;

wherein said log extents comprise pointers to portions of the storage pool that are used to store transaction logs for the storage pool; and

wherein said map of container id (CID) to disk offsets comprises a mechanism for looking up container IDs to find disk offsets in the storage pool.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A map-reduce compatible distributed file system that consists of successive component layers that each provide the basis on which the next layer is built provides transactional read-write-update semantics with file chunk replication and huge file-create rates. A primitive storage layer (storage pools) knits together raw block stores and provides a storage mechanism for containers and transaction logs. Storage pools are manipulated by individual file servers. Containers provide the fundamental basis for data replication, relocation, and transactional updates. A container location database allows containers to be found among all file servers, as well as defining precedence among replicas of containers to organize transactional updates of container contents. Volumes facilitate control of data placement, creation of snapshots and mirrors, and retention of a variety of control and policy information. Key-value stores relate keys to data for such purposes as directories, container location maps, and offset maps in compressed files.

Citations

67 Claims

1. A distributed file system comprising:
- a processor, said processor implementing a plurality of storage pools that bind raw block stores together and that provide a storage mechanism for containers and transaction logs;
  
  said processor implementing a plurality of containers configured for any of data replication, relocation, and transactional updates; and
  
  a container location database configured to locate specific containers within a plurality of file servers, and with which precedence among replicas of containers is defined to organize transactional updates of container contents;
  
  wherein each said storage pool comprises a plurality of bitmap extents, a plurality of log extents, and a map of container id (CID) to container disk offset, each of which is stored in a super block that is replicated to several well-known locations in the storage pool;
  
  wherein said bitmap extents comprise pointers to multiple block allocation bitmaps for the storage pool;
  
  wherein said log extents comprise pointers to portions of the storage pool that are used to store transaction logs for the storage pool; and
  
  wherein said map of container id (CID) to disk offsets comprises a mechanism for looking up container IDs to find disk offsets in the storage pool.

2. A distributed file system, comprising:
- a processor, said processor implementing a plurality of containers in which each container stores file and directory meta-data as well as file content data;
  
  wherein references to file content data are stored on a subset of nodes on which container meta-data and data are stored;
  
  a container location database (CLDB) configured to maintain information about where each of said plurality of containers is located;
  
  a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers;
  
  a plurality of inodes for structuring data within said containers;
  
  wherein said CLDB is configured to assign nodes as replicas of data in a container to meet policy constraints;
  
  wherein each said storage pool comprises a plurality of bitmap extents, a plurality of log extents, and a map of container id (CID) to container disk offset, each of which is stored in a super block that is replicated to several well-known locations in the storage pool;
  
  wherein said bitmap extents comprise pointers to multiple block allocation bitmaps for the storage pool;
  
  wherein said log extents comprise pointers to portions of the storage pool that are used to store transaction logs for the storage pool; and
  
  wherein said map of container id (CID) to disk offsets comprises a mechanism for looking up container IDs to find disk offsets in the storage pool.
- View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 3. The distributed file system of claim 2, wherein container data and metadata are arranged to allow a topological sort to imply update order.
  - 4. The distributed file system of claim 2, wherein said CLDB assigns each container a master node that controls all transactions for that container.
  - 5. The distributed file system of claim 2, wherein said CLDB designates a chain of nodes to hold replicas.
  - 6. The distributed file system of claim 2, wherein when one of the replicas goes down or is separated from the master CLDB node, it is removed from the replication chain.
  - 7. The distributed file system of claim 2, wherein when the master goes down or is separated, a new master is designated.
  - 8. The map-reduce compatible distributed file system of claim 2, wherein any node that comes back after having been removed from the replication chain is reinserted at the end of the replication chain when the chain still needs another replication chain when the node returns.
  - 9. The distributed file system of claim 2, wherein when the node returns within a first predetermined interval, no new node to replicate the container in question has been designated and the chain still needs a replication chain.
  - 10. The distributed file system of claim 2, wherein when the node has been gone for a second, longer predetermined interval, the CLDB may designate some other node to take a place in the chain.
  - 11. The distributed file system of claim 2, further comprising:
    - a map-reduce compatible shuffle function, wherein each map function writes to the distributed file system and each reduce function reads input from the distributed file system.
  - 12. The distributed file system of claim 2, further comprising:
    - a plurality of volumes configured to facilitate any of control of data placement, creation of mirrors, and retention of control and policy information.
  - 13. The distributed file system of claim 2, further comprising:
    - a plurality of key-value stores configured to relate keys to data for any of directories, container location maps, and offset maps in compressed files.
  - 14. The distributed file system of claim 2, wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain.
  - 15. The distributed file system of claim 2, wherein said CLDB is maintained by a plurality of redundant servers;
    - andwherein data in the CLDB is itself stored as inodes in well-known containers.
  - 16. The distributed file system of claim 15, wherein said CLDB nodes are configured to maintain a database that contains at least the following information about all of said containers:
    - nodes that have replicas of a container; and
      
      an ordering of a replication chain for each container.
  - 17. The distributed file system of claim 2, wherein container master is configured to control updates to replication chains transactionally.
  - 18. The distributed file system of claim 2, wherein all inode data structures and indirect data b-trees comprise version numbers that facilitate updating container replicas that have missed transactions.
  - 19. The distributed file system of claim 2, wherein data is stored in the distributed file system on multiple block-addressable data stores that comprise block devices that represent any of entire disks, flash memory systems, partitions of either of these, and individual files stored in a conventional file system;
    - wherein each data store supports random reading and writing of relatively small, fixed-size blocks of data.
  - 20. The distributed file system of claim 2, further comprising:
    - a plurality of file identifiers (FID), each FID referring to an inode in a particular container, each FID comprising a container id, an inode number, and an integer chosen to make contents of the FID unique, even if an inode is reused for a different purpose.
  - 21. The distributed file system of claim 2, wherein said distributed file system is configured as a read-write access file system, wherein random updates and reads occur from any node in a cluster and/or from any device that has unfettered access to other devices in the cluster.

22. A distributed file system, comprising:
- a processor, said processor implementing a plurality of containers in which each container stores file and directory meta-data as well as file content data;
  
  wherein references to file content data are stored on a subset of nodes on which container meta-data and data are stored;
  
  a container location database (CLDB) configured to maintain information about where each of said plurality of containers is located;
  
  a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers; and
  
  a plurality of inodes for structuring data within said containers;
  
  wherein said CLDB is configured to assign nodes as replicas of data in a container to meet policy constraints;
  
  each inode further comprising a composite data structure that contains attributes that describe various aspects of each object including any of owner, permissions, parent file identifier (FID), object type, and size;
  
  wherein object type comprises any of a local file, chunked file, directory, key-value store, symbolic link, or volume mount point;
  
  wherein said inode further comprises pointers to disk blocks that contain a first set of bytes of data in the object;
  
  wherein each of said pointers comprises an associated copy-on-write bit stored with said pointers;
  
  wherein said inode further comprises references to indirect data which, in the case of local files can also comprise a pointer to a B+ tree that contains the object data, along with a copy-on-write bit for that tree and, in the case of a chunked file, a pointer to a local file, referred to as a FID map, that contains FID'"'"'s that refer to local files in other containers containing content of the file;
  
  wherein said inode further comprises a cache of a latest version number for any structure referenced from the inode; and
  
  wherein said version number is configured for use in replication and mirroring.
- View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44)
- - 23. The distributed file system of claim 22, wherein said chunked file comprises a file that is made up of chunks stored in many containers, where each chunk is represented as a local file and references from a chunked file inode lead to an array of references to these local files.
  - 24. The distributed file system of claim 22, wherein said symbolic link is stored as a local file that contains the name of a file and can point to any distributed file system object.
  - 25. The distributed file system of claim 22, wherein a volume mount is stored as a local file that contains a name of a volume to be mounted.
  - 26. The distributed file system of claim 22, wherein container data and metadata are arranged to allow a topological sort to imply update order.
  - 27. The distributed file system of claim 22, wherein said CLDB assigns each container a master node that controls all transactions for that container.
  - 28. The distributed file system of claim 22, wherein said CLDB designates a chain of nodes to hold replicas.
  - 29. The distributed file system of claim 22, wherein when one of the replicas goes down or is separated from the master CLDB node, it is removed from the replication chain.
  - 30. The distributed file system of claim 22, wherein when the master goes down or is separated, a new master is designated.
  - 31. The map-reduce compatible distributed file system of claim 22, wherein any node that comes back after having been removed from the replication chain is reinserted at the end of the replication chain when the chain still needs another replication chain when the node returns.
  - 32. The distributed file system of claim 22, wherein when the node returns within a first predetermined interval, no new node to replicate the container in question has been designated and the chain still needs a replication chain.
  - 33. The distributed file system of claim 22, wherein when the node has been gone for a second, longer predetermined interval, the CLDB may designate some other node to take a place in the chain.
  - 34. The distributed file system of claim 22, further comprising:
    - a map-reduce compatible shuffle function, wherein each map function writes to the distributed file system and each reduce function reads input from the distributed file system.
  - 35. The distributed file system of claim 22, further comprising:
    - a plurality of volumes configured to facilitate any of control of data placement, creation of mirrors, and retention of control and policy information.
  - 36. The distributed file system of claim 22, further comprising:
    - a plurality of key-value stores configured to relate keys to data for any of directories, container location maps, and offset maps in compressed files.
  - 37. The distributed file system of claim 22, wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain.
  - 38. The distributed file system of claim 22, wherein said CLDB is maintained by a plurality of redundant servers;
    - andwherein data in the CLDB is itself stored as inodes in well-known containers.
  - 39. The distributed file system of claim 38, wherein said CLDB nodes are configured to maintain a database that contains at least the following information about all of said containers:
    - nodes that have replicas of a container; and
      
      an ordering of a replication chain for each container.
  - 40. The distributed file system of claim 22, wherein container master is configured to control updates to replication chains transactionally.
  - 41. The distributed file system of claim 22, wherein all inode data structures and indirect data b-trees comprise version numbers that facilitate updating container replicas that have missed transactions.
  - 42. The distributed file system of claim 22, wherein data is stored in the distributed file system on multiple block-addressable data stores that comprise block devices that represent any of entire disks, flash memory systems, partitions of either of these, and individual files stored in a conventional file system;
    - wherein each data store supports random reading and writing of relatively small, fixed-size blocks of data.
  - 43. The distributed file system of claim 22, further comprising:
    - a plurality of file identifiers (FID), each FID referring to an inode in a particular container, each FID comprising a container id, an inode number, and an integer chosen to make contents of the FID unique, even if an inode is reused for a different purpose.
  - 44. The distributed file system of claim 22, wherein said distributed file system is configured as a read-write access file system, wherein random updates and reads occur from any node in a cluster and/or from any device that has unfettered access to other devices in the cluster.

45. A distributed file system, comprising:
- a processor, said processor implementing a plurality of containers in which each container stores file and directory meta-data as well as file content data;
  
  wherein references to file content data are stored on a subset of nodes on which container meta-data and data are stored;
  
  a container location database (CLDB) configured to maintain information about where each of said plurality of containers is located;
  
  a plurality of cluster nodes, each cluster node containing one or more storage pools, each storage pool containing zero or more containers;
  
  a plurality of inodes for structuring data within said containers wherein said CLDB is configured to assign nodes as replicas of data in a container to meet policy constraints; and
  
  wherein said distributed file system is configured for stateless access;
  
  a plurality of NFS gateways;
  
  wherein said distributed file system is configured for access via NFS network protocols; and
  
  a coordination server by which said NFS gateways cooperatively decide which of said NFS gateways host which IP addresses;
  
  wherein all file names accessed via the distributed file system start with a common prefix followed by a cluster name and a name of a file within said cluster; and
  
  wherein said NFS gateways are configured to populate a top-level virtual directory associated with said common prefix with virtual files corresponding to each accessible cluster.
- View Dependent Claims (46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65)
- - 46. The distributed file system of claim 45, all NFS servers can access all files in the distributed file system.
  - 47. The distributed file system of claim 45, wherein container data and metadata are arranged to allow a topological sort to imply update order.
  - 48. The distributed file system of claim 45, wherein said CLDB assigns each container a master node that controls all transactions for that container.
  - 49. The distributed file system of claim 45, wherein said CLDB designates a chain of nodes to hold replicas.
  - 50. The distributed file system of claim 45, wherein when one of the replicas goes down or is separated from the master CLDB node, it is removed from the replication chain.
  - 51. The distributed file system of claim 45, wherein when the master goes down or is separated, a new master is designated.
  - 52. The map-reduce compatible distributed file system of claim 45, wherein any node that comes back after having been removed from the replication chain is reinserted at the end of the replication chain when the chain still needs another replication chain when the node returns.
  - 53. The distributed file system of claim 45, wherein when the node returns within a first predetermined interval, no new node to replicate the container in question has been designated and the chain still needs a replication chain.
  - 54. The distributed file system of claim 45, wherein when the node has been gone for a second, longer predetermined interval, the CLDB may designate some other node to take a place in the chain.
  - 55. The distributed file system of claim 45, further comprising:
    - a map-reduce compatible shuffle function, wherein each map function writes to the distributed file system and each reduce function reads input from the distributed file system.
  - 56. The distributed file system of claim 45, further comprising:
    - a plurality of volumes configured to facilitate any of control of data placement, creation of mirrors, and retention of control and policy information.
  - 57. The distributed file system of claim 45, further comprising:
    - a plurality of key-value stores configured to relate keys to data for any of directories, container location maps, and offset maps in compressed files.
  - 58. The distributed file system of claim 45, wherein said containers are replicated to other cluster nodes with one container designated as master for each replication chain.
  - 59. The distributed file system of claim 58, wherein said CLDB nodes are configured to maintain a database that contains at least the following information about all of said containers:
    - nodes that have replicas of a container; and
      
      an ordering of a replication chain for each container.
  - 60. The distributed file system of claim 45, wherein said CLDB is maintained by a plurality of redundant servers;
    - andwherein data in the CLDB is itself stored as inodes in well-known containers.
  - 61. The distributed file system of claim 45, wherein container master is configured to control updates to replication chains transactionally.
  - 62. The distributed file system of claim 45, wherein all inode data structures and indirect data b-trees comprise version numbers that facilitate updating container replicas that have missed transactions.
  - 63. The distributed file system of claim 45, wherein data is stored in the distributed file system on multiple block-addressable data stores that comprise block devices that represent any of entire disks, flash memory systems, partitions of either of these, and individual files stored in a conventional file system;
    - wherein each data store supports random reading and writing of relatively small, fixed-size blocks of data.
  - 64. The distributed file system of claim 45, further comprising:
    - a plurality of file identifiers (FID), each FID referring to an inode in a particular container, each FID comprising a container id, an inode number, and an integer chosen to make contents of the FID unique, even if an inode is reused for a different purpose.
  - 65. The distributed file system of claim 45, wherein said distributed file system is configured as a read-write access file system, wherein random updates and reads occur from any node in a cluster and/or from any device that has unfettered access to other devices in the cluster.

66. A distributed file system comprising:
- a processor, said processor implementing a plurality of storage pools that bind raw block stores together and that provide a storage mechanism for containers and transaction logs;
  
  said processor implementing a plurality of containers configured for any of data replication, relocation, and transactional updates; and
  
  a container location database configured to locate specific containers within a plurality of file servers, and with which precedence among replicas of containers is defined to organize transactional updates of container contents;
  
  a plurality of inodes for structuring data within said containers, each inode further comprising a composite data structure that contains attributes that describe various aspects of each object including any of owner, permissions, parent container id (CID), object type, and size;
  
  wherein object type comprises any of a local file, chunked file, directory, key-value store, symbolic link, or volume mount point;
  
  wherein said inode further comprises pointers to disk blocks that contain a first set of bytes of data in the object;
  
  wherein each of said pointers comprises an associated copy-on-write bit stored with said pointers;
  
  wherein said inode further comprises references to indirect data which, in the case of local files can also comprise a pointer to a B+ tree that contains the object data, along with a copy-on-write bit for that tree and, in the case of a chunked file, a pointer to a local file, referred to as a FID map, that contains FID'"'"'s that refer to local files in other containers containing content of the file;
  
  wherein said inode further comprises a cache of a latest version number for any structure referenced from the inode; and
  
  wherein said version number is configured for use in replication and mirroring.

67. A distributed file system comprising:
- a processor, said processor implementing a plurality of storage pools that bind raw block stores together and that provide a storage mechanism for containers and transaction logs;
  
  said processor implementing a plurality of containers configured for any of data replication, relocation, and transactional updates; and
  
  a container location database configured to locate specific containers within a plurality of file servers, and with which precedence among replicas of containers is defined to organize transactional updates of container contents;
  
  wherein said distributed file system is configured for stateless access;
  
  a plurality of NFS gateways;
  
  wherein said distributed file system is configured for access via NFS network protocols; and
  
  a coordination server by which said NFS gateways cooperatively decide which of said NFS gateways host which IP addresses;
  
  wherein all file names accessed via the distributed file system start with a common prefix followed by a cluster name and a name of a file within said cluster; and
  
  wherein said NFS gateways are configured to populate a top-level virtual directory associated with said common prefix with virtual files corresponding to each accessible cluster.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
MapR Technologies, Inc. (Hewlett-Packard Enterprise Company)
Inventors
Srivas, Mandayam C., Ravindra, Pindikura, Saradhi, Uppaluri Vijaya, Pande, Arvind Arun, Sanapala, Chandra Guru Kiran Babu, Renu, Lohit Vijaya, Kavacheri, Sathya, Hadke, Amit, Vellanki, Vivekanand
Primary Examiner(s)
Arjomandi, Noosha

Application Number

US15/135,311
Publication Number

US 20160239514A1
Time in Patent Office

383 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/10   File systems; File servers

G06F 16/128   Details of file system snap...

G06F 16/178   Techniques for file synchro...

G06F 16/182   Distributed file systems

G06F 16/1844   Management specifically ada...

G06F 16/1865   Transactional file systems

G06F 16/22   Indexing; Data structures t...

G06F 16/2246   Trees, e.g. B+trees

G06F 16/23   Updating

G06F 16/235   Update request formulation

G06F 16/2365   Ensuring data consistency a...

G06F 16/27   Replication, distribution o...

G06F 16/273   Asynchronous replication or...

G06F 16/275   Synchronous replication

G06F 8/658   Incremental updates; Differ...

H04L 65/102   Gateways arrangements for c...

Map-reduce ready distributed file system

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

67 Claims

Specification

Solutions

Use Cases

Quick Links

Map-reduce ready distributed file system

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

67 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links