Transferring and caching a cloud file in a distributed filesystem
First Claim
1. A computer-implemented method for transferring and caching a cloud file in a distributed filesystem, the method comprising:
- collectively managing the data of the distributed filesystem using two or more cloud controllers, wherein the cloud controllers cache and ensure data consistency for data stored in a cloud storage system, wherein each cloud controller maintains a metadata hierarchy that reflects the current state of the distributed filesystem, wherein changes to the metadata for the distributed filesystem are communicated to the set of cloud controllers for the distributed filesystem to ensure that the clients of the distributed filesystem share a consistent view of the files in the distributed filesystem, wherein the cloud storage system stores groups of multiple data blocks for the distributed filesystem in cloud files, wherein a cloud file comprises a set of data blocks from multiple distinct distributed filesystem files and a metadata index that describes the set of data blocks stored in the cloud file;
receiving at a cloud controller a request from a client for a data block of a target file in the distributed filesystem, wherein the requested data block is not currently cached in the cloud controller;
initiating a transfer for a cloud file containing the requested data block from the cloud storage system to the cloud controller, wherein the metadata hierarchy facilitates identifying which cloud file contains the requested data block of the target file but the metadata hierarchy does not include a reverse mapping that facilitates quickly determining the location of other data blocks in the cloud file in the metadata hierarchy of the distributed filesystem;
while the cloud file has not yet completely been downloaded from the cloud storage system to the cloud controller, already on the cloud controller extracting the metadata index for the cloud file from an initial portion of the cloud file that has already been downloaded to the cloud controller; and
while portions of the cloud file are still being downloaded to the cloud controller;
using the metadata index and the metadata hierarchy to determine whether other data blocks in the cloud file are likely to be accessed in a substantially similar timeframe as the requested data block; and
downloading from the cloud storage system to the cloud controller a limited subset of data blocks from the cloud file that include (1) the requested data block;
(2) the other data blocks from the cloud file that have been determined to be likely to be accessed; and
(3) any blocks of the cloud file that are needed to decrypt the portions of the cloud file containing (1) and (2), wherein not downloading and caching the entire cloud file reduces bandwidth usage and improves cache access performance for the distributed filesystem.
9 Assignments
0 Petitions
Accused Products
Abstract
The disclosed embodiments disclose techniques for transferring and caching a cloud file in a cloud controller. Two or more cloud controllers collectively manage distributed filesystem data that is stored in one or more cloud storage systems; the cloud controllers cache and ensure data consistency for the stored data. During operation, a cloud controller receives a client request for a data block of a target file that is stored in the distributed filesystem but not currently cached in the cloud controller. The cloud controller initiates a request to a cloud storage system for a cloud file containing the requested data block. While receiving the cloud file from the cloud storage system, the cloud controller uses a set of block metadata in the portion of the cloud file that has already been received to determine the portions of the cloud file that should be downloaded to and cached in the cloud controller.
76 Citations
20 Claims
-
1. A computer-implemented method for transferring and caching a cloud file in a distributed filesystem, the method comprising:
-
collectively managing the data of the distributed filesystem using two or more cloud controllers, wherein the cloud controllers cache and ensure data consistency for data stored in a cloud storage system, wherein each cloud controller maintains a metadata hierarchy that reflects the current state of the distributed filesystem, wherein changes to the metadata for the distributed filesystem are communicated to the set of cloud controllers for the distributed filesystem to ensure that the clients of the distributed filesystem share a consistent view of the files in the distributed filesystem, wherein the cloud storage system stores groups of multiple data blocks for the distributed filesystem in cloud files, wherein a cloud file comprises a set of data blocks from multiple distinct distributed filesystem files and a metadata index that describes the set of data blocks stored in the cloud file; receiving at a cloud controller a request from a client for a data block of a target file in the distributed filesystem, wherein the requested data block is not currently cached in the cloud controller; initiating a transfer for a cloud file containing the requested data block from the cloud storage system to the cloud controller, wherein the metadata hierarchy facilitates identifying which cloud file contains the requested data block of the target file but the metadata hierarchy does not include a reverse mapping that facilitates quickly determining the location of other data blocks in the cloud file in the metadata hierarchy of the distributed filesystem; while the cloud file has not yet completely been downloaded from the cloud storage system to the cloud controller, already on the cloud controller extracting the metadata index for the cloud file from an initial portion of the cloud file that has already been downloaded to the cloud controller; and while portions of the cloud file are still being downloaded to the cloud controller; using the metadata index and the metadata hierarchy to determine whether other data blocks in the cloud file are likely to be accessed in a substantially similar timeframe as the requested data block; and downloading from the cloud storage system to the cloud controller a limited subset of data blocks from the cloud file that include (1) the requested data block;
(2) the other data blocks from the cloud file that have been determined to be likely to be accessed; and
(3) any blocks of the cloud file that are needed to decrypt the portions of the cloud file containing (1) and (2), wherein not downloading and caching the entire cloud file reduces bandwidth usage and improves cache access performance for the distributed filesystem. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A non-transitory computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for transferring and caching a cloud file in a distributed filesystem, the method comprising:
-
collectively managing the data of the distributed filesystem using two or more cloud controllers, wherein the cloud controllers cache and ensure data consistency for data stored in a cloud storage system, wherein each cloud controller maintains a metadata hierarchy that reflects the current state of the distributed filesystem, wherein changes to the metadata for the distributed filesystem are communicated to the set of cloud controllers for the distributed filesystem to ensure that the clients of the distributed filesystem share a consistent view of the files in the distributed filesystem, wherein the cloud storage system stores groups of multiple data blocks for the distributed filesystem in cloud files, wherein a cloud file comprises a set of data blocks from multiple distinct distributed filesystem files and a metadata index that describes the set of data blocks stored in the cloud file; receiving at a cloud controller a request from a client for a data block of a target file in the distributed filesystem, wherein the requested data block is not currently cached in the cloud controller; initiating a transfer for a cloud file containing the requested data block from the cloud storage system to the cloud controller, wherein the metadata hierarchy facilitates identifying which cloud file contains the requested data block of the target file but the metadata hierarchy does not include a reverse mapping that facilitates quickly determining the location of other data blocks in the cloud file in the metadata hierarchy of the distributed filesystem; while the cloud file has not yet completely been downloaded from the cloud storage system to the cloud controller, already on the cloud controller extracting the metadata index for the cloud file from an initial portion of the cloud file that has already been downloaded to the cloud controller; and while portions of the cloud file are still being downloaded to the cloud controller; using the metadata index and the metadata hierarchy to determine whether other data blocks in the cloud file are likely to be accessed in a substantially similar timeframe as the requested data block; and downloading from the cloud storage system to the cloud controller a limited subset of data blocks from the cloud file that include (1) the requested data block;
(2) the other data blocks from the cloud file that have been determined to be likely to be accessed; and
(3) any blocks of the cloud file that are needed to decrypt the portions of the cloud file containing (1) and (2), wherein not downloading and caching the entire cloud file reduces bandwidth usage and improves cache access performance for the distributed filesystem. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
-
20. A cloud controller that transfers and caches a cloud file in a distributed filesystem, comprising:
-
a processor; a storage mechanism that stores metadata for the distributed filesystem; and a storage management mechanism; wherein two or more cloud controllers collectively manage the data of the distributed filesystem, wherein the cloud controllers cache and ensure data consistency for data stored in a cloud storage system, wherein each cloud controller maintains a metadata hierarchy that reflects the current state of the distributed filesystem, wherein changes to the metadata for the distributed filesystem are communicated to the set of cloud controllers for the distributed filesystem to ensure that the clients of the distributed filesystem share a consistent view of the files in the distributed filesystem, wherein the cloud storage system stores groups of multiple data blocks for the distributed filesystem in cloud files, wherein a cloud file comprises a set of data blocks from multiple distinct distributed filesystem files and a metadata index that describes the set of data blocks stored in the cloud file; wherein the cloud controller is configured to receive a request from a client for a data block of a target file in the distributed filesystem, wherein the requested data block is not currently cached in the cloud controller; wherein the storage management mechanism is configured to initiate a transfer for a cloud file containing the requested data block from the cloud storage system to the cloud controller, wherein the metadata hierarchy facilitates identifying which cloud file contains the requested data block of the target file but the metadata hierarchy does not include a reverse mapping that facilitates quickly determining the location of other data blocks in the cloud file in the metadata hierarchy of the distributed filesystem; wherein, while the cloud file has not yet completely been downloaded from the cloud storage system to the cloud controller, the cloud controller is configured to already extract the metadata index for the cloud file from an initial portion of the cloud file that has already been downloaded to the cloud controller; and wherein, while portions of the cloud file are still being transferred to the cloud controller, the storage management mechanism is further configured; use the metadata index and the metadata hierarchy to determine whether other data blocks in the cloud file are likely to be accessed in a substantially similar timeframe as the requested data block; and download from the cloud storage system to the cloud controller a limited subset of data blocks from the cloud file that include (1) the requested data block;
(2) the other data blocks from the cloud file that have been determined to be likely to be accessed; and
(3) any blocks of the cloud file that are needed to decrypt the portions of the cloud file containing (1) and (2), wherein not downloading and caching the entire cloud file reduces bandwidth usage and improves cache access performance for the distributed filesystem.
-
Specification