Transferring and caching a cloud file in a distributed filesystem

US 9,852,149 B1
Filed: 02/15/2013
Issued: 12/26/2017
Est. Priority Date: 05/03/2010
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for transferring and caching a cloud file in a distributed filesystem, the method comprising:

collectively managing the data of the distributed filesystem using two or more cloud controllers, wherein the cloud controllers cache and ensure data consistency for data stored in a cloud storage system, wherein each cloud controller maintains a metadata hierarchy that reflects the current state of the distributed filesystem, wherein changes to the metadata for the distributed filesystem are communicated to the set of cloud controllers for the distributed filesystem to ensure that the clients of the distributed filesystem share a consistent view of the files in the distributed filesystem, wherein the cloud storage system stores groups of multiple data blocks for the distributed filesystem in cloud files, wherein a cloud file comprises a set of data blocks from multiple distinct distributed filesystem files and a metadata index that describes the set of data blocks stored in the cloud file;

receiving at a cloud controller a request from a client for a data block of a target file in the distributed filesystem, wherein the requested data block is not currently cached in the cloud controller;

initiating a transfer for a cloud file containing the requested data block from the cloud storage system to the cloud controller, wherein the metadata hierarchy facilitates identifying which cloud file contains the requested data block of the target file but the metadata hierarchy does not include a reverse mapping that facilitates quickly determining the location of other data blocks in the cloud file in the metadata hierarchy of the distributed filesystem;

while the cloud file has not yet completely been downloaded from the cloud storage system to the cloud controller, already on the cloud controller extracting the metadata index for the cloud file from an initial portion of the cloud file that has already been downloaded to the cloud controller; and

while portions of the cloud file are still being downloaded to the cloud controller;

using the metadata index and the metadata hierarchy to determine whether other data blocks in the cloud file are likely to be accessed in a substantially similar timeframe as the requested data block; and

downloading from the cloud storage system to the cloud controller a limited subset of data blocks from the cloud file that include (1) the requested data block;

(2) the other data blocks from the cloud file that have been determined to be likely to be accessed; and

(3) any blocks of the cloud file that are needed to decrypt the portions of the cloud file containing (1) and (2), wherein not downloading and caching the entire cloud file reduces bandwidth usage and improves cache access performance for the distributed filesystem.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The disclosed embodiments disclose techniques for transferring and caching a cloud file in a cloud controller. Two or more cloud controllers collectively manage distributed filesystem data that is stored in one or more cloud storage systems; the cloud controllers cache and ensure data consistency for the stored data. During operation, a cloud controller receives a client request for a data block of a target file that is stored in the distributed filesystem but not currently cached in the cloud controller. The cloud controller initiates a request to a cloud storage system for a cloud file containing the requested data block. While receiving the cloud file from the cloud storage system, the cloud controller uses a set of block metadata in the portion of the cloud file that has already been received to determine the portions of the cloud file that should be downloaded to and cached in the cloud controller.

76 Citations

View as Search Results

20 Claims

1. A computer-implemented method for transferring and caching a cloud file in a distributed filesystem, the method comprising:
- collectively managing the data of the distributed filesystem using two or more cloud controllers, wherein the cloud controllers cache and ensure data consistency for data stored in a cloud storage system, wherein each cloud controller maintains a metadata hierarchy that reflects the current state of the distributed filesystem, wherein changes to the metadata for the distributed filesystem are communicated to the set of cloud controllers for the distributed filesystem to ensure that the clients of the distributed filesystem share a consistent view of the files in the distributed filesystem, wherein the cloud storage system stores groups of multiple data blocks for the distributed filesystem in cloud files, wherein a cloud file comprises a set of data blocks from multiple distinct distributed filesystem files and a metadata index that describes the set of data blocks stored in the cloud file;
  
  receiving at a cloud controller a request from a client for a data block of a target file in the distributed filesystem, wherein the requested data block is not currently cached in the cloud controller;
  
  initiating a transfer for a cloud file containing the requested data block from the cloud storage system to the cloud controller, wherein the metadata hierarchy facilitates identifying which cloud file contains the requested data block of the target file but the metadata hierarchy does not include a reverse mapping that facilitates quickly determining the location of other data blocks in the cloud file in the metadata hierarchy of the distributed filesystem;
  
  while the cloud file has not yet completely been downloaded from the cloud storage system to the cloud controller, already on the cloud controller extracting the metadata index for the cloud file from an initial portion of the cloud file that has already been downloaded to the cloud controller; and
  
  while portions of the cloud file are still being downloaded to the cloud controller;
  
  using the metadata index and the metadata hierarchy to determine whether other data blocks in the cloud file are likely to be accessed in a substantially similar timeframe as the requested data block; and
  
  downloading from the cloud storage system to the cloud controller a limited subset of data blocks from the cloud file that include (1) the requested data block;
  
  (2) the other data blocks from the cloud file that have been determined to be likely to be accessed; and
  
  (3) any blocks of the cloud file that are needed to decrypt the portions of the cloud file containing (1) and (2), wherein not downloading and caching the entire cloud file reduces bandwidth usage and improves cache access performance for the distributed filesystem.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The computer implemented method of claim 1,wherein the method further comprises:
    - collectively presenting a unified namespace for the distributed filesystem to the clients of the distributed filesystem via the two or more cloud controllers, wherein the clients access the distributed filesystem via the cloud controllers, wherein the file data for the distributed filesystem is stored in the cloud storage system, wherein cloud controllers cache a subset of the file data from the remote cloud storage system that is being actively accessed by each respective cloud controller'"'"'s clients, wherein new file data received by each cloud controller from its clients is written to the remote cloud storage system; and
      
      wherein the block metadata stored in metadata index for the cloud file includes metadata for each data block that is stored in the cloud file; and
      
      wherein the block metadata includes unique file identifiers associated with each data block that is stored in the cloud file.
  - 3. The computer implemented method of claim 2,wherein the block metadata is located at the beginning of the cloud file and received by the cloud controller prior to the data blocks stored in the cloud file;
    - andwherein the cloud controller is configured to read and analyze the block metadata in parallel with downloading additional portions of the cloud file to determine the set of actions to be taken for the data blocks of the cloud file.
  - 4. The computer implemented method of claim 3, wherein the method further comprises:
    - determining from the metadata hierarchy a set of file identifiers for files and directories that are logically proximate to the target file in the distributed filesystem; and
      
      comparing the file identifiers for the logically proximate files and directories with the file identifiers that are associated with the data blocks and the block metadata that are stored in the metadata index for the cloud file to determine a subset of the data blocks from the cloud file are likely to benefit from being opportunistically cached in the cloud controller;
      
      wherein opportunistic caching techniques leverage spatial and temporal locality to improve access performance for the distributed filesystem.
  - 5. The computer implemented method of claim 4, wherein comparing the two sets of file identifiers further comprises:
    - determining from the comparisons between metadata downloaded in the cloud file and metadata stored in the cloud controller that a portion of the cloud file is composed of data blocks that are not likely to be needed in the cloud controller; and
      
      not downloading the portion of the cloud file containing the unneeded data blocks.
  - 6. The computer implemented method of claim 4,wherein the cloud file is serially encrypted such that all data blocks preceding any needed data blocks need to be downloaded to decrypt the needed data blocks;
    - andwherein comparing the two sets of file identifiers further comprises;
      
      determining an initial portion of the cloud file that is composed of data blocks that are not likely to be needed in the cloud controller;
      
      determining a subsequent portion of the cloud file that is composed of data blocks that are likely to be needed in the cloud controller, wherein the subsequent portion is located later in the cloud file than the initial portion; and
      
      downloading the initial portion of the cloud file to ensure that the subsequent portion of the cloud file can be serially decrypted; and
      
      upon determining that the remaining portion of the cloud file that follows the subsequent portion is not likely to be needed in the cloud controller, terminating the download of the remaining portion of the cloud file.
  - 7. The computer implemented method of claim 4, wherein an entry in the block metadata specifies:
    - a unique filename identifier for an associated data block in the cloud file;
      
      a compression technique used to compress the associated data block;
      
      the logical size of the associated data block;
      
      the physical size of the associated data block;
      
      a checksum for the associated data block;
      
      the checksum technique used to calculate the checksum; and
      
      the type of the checksum.
  - 8. The computer implemented method of claim 4, wherein determining a subset of the data blocks from the cloud file that will be opportunistically cached in the cloud controller comprises using a locality policy that specifies a level of pre fetching and opportunistic caching for at least one of the cloud controller and the portion of the distributed filesystem that includes the target file.
  - 9. The computer implemented method of claim 4, wherein collectively managing the data of the distributed filesystem comprises:
    - upon receiving in a cloud controller new data from a client, sending from the cloud controller an incremental metadata snapshot for the new data, wherein the incremental metadata snapshot is received by the other cloud controllers of the distributed filesystem;
      
      storing the data for the distributed filesystem in one or more cloud storage systems, wherein the cloud controllers cache and ensure data consistency for data stored in the cloud storage systems; and
      
      sending an incremental data snapshot containing the new data from the cloud controller to the cloud storage system.
  - 10. The computer-implemented method of claim 9,wherein cloud files comprise logical storage volumes in the cloud storage system that store data and meta data for the distributed filesystem;
    - wherein the cloud controller manages the layout of data being written into cloud files to facilitate opportunistic caching for subsequent accesses; and
      
      wherein the cloud storage system is unaware of the organization and structure of the distributed filesystem.
  - 11. The computer-implemented method of claim 10,wherein data in the distributed filesystem is indexed using a global address space;
    - wherein each cloud file is uniquely indexed in the global address space; and
      
      wherein determining that the requested data block is not presently stored in the cloud controller comprises;
      
      accessing in the metadata for the distributed filesystem a metadata entry that is associated with the target file;
      
      determining from the metadata entry that the data block is not presently cached in the cloud controller;
      
      using a global address stored in the metadata entry to identify the cloud file; and
      
      using an offset stored in the metadata entry to determine the location of the data block in the cloud file.

12. A non-transitory computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for transferring and caching a cloud file in a distributed filesystem, the method comprising:
- collectively managing the data of the distributed filesystem using two or more cloud controllers, wherein the cloud controllers cache and ensure data consistency for data stored in a cloud storage system, wherein each cloud controller maintains a metadata hierarchy that reflects the current state of the distributed filesystem, wherein changes to the metadata for the distributed filesystem are communicated to the set of cloud controllers for the distributed filesystem to ensure that the clients of the distributed filesystem share a consistent view of the files in the distributed filesystem, wherein the cloud storage system stores groups of multiple data blocks for the distributed filesystem in cloud files, wherein a cloud file comprises a set of data blocks from multiple distinct distributed filesystem files and a metadata index that describes the set of data blocks stored in the cloud file;
  
  receiving at a cloud controller a request from a client for a data block of a target file in the distributed filesystem, wherein the requested data block is not currently cached in the cloud controller;
  
  initiating a transfer for a cloud file containing the requested data block from the cloud storage system to the cloud controller, wherein the metadata hierarchy facilitates identifying which cloud file contains the requested data block of the target file but the metadata hierarchy does not include a reverse mapping that facilitates quickly determining the location of other data blocks in the cloud file in the metadata hierarchy of the distributed filesystem;
  
  while the cloud file has not yet completely been downloaded from the cloud storage system to the cloud controller, already on the cloud controller extracting the metadata index for the cloud file from an initial portion of the cloud file that has already been downloaded to the cloud controller; and
  
  while portions of the cloud file are still being downloaded to the cloud controller;
  
  using the metadata index and the metadata hierarchy to determine whether other data blocks in the cloud file are likely to be accessed in a substantially similar timeframe as the requested data block; and
  
  downloading from the cloud storage system to the cloud controller a limited subset of data blocks from the cloud file that include (1) the requested data block;
  
  (2) the other data blocks from the cloud file that have been determined to be likely to be accessed; and
  
  (3) any blocks of the cloud file that are needed to decrypt the portions of the cloud file containing (1) and (2), wherein not downloading and caching the entire cloud file reduces bandwidth usage and improves cache access performance for the distributed filesystem.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The non-transitory computer-readable storage medium of claim 12,wherein the block metadata in the cloud file includes metadata for each data block that is stored in the cloud file;
    - andwherein the block metadata includes unique file identifiers associated with each data block that is stored in the cloud file.
  - 14. The non-transitory computer-readable storage medium of claim 13,wherein the block metadata is located at the beginning of the cloud file and received by the cloud controller prior to the data blocks stored in the cloud file;
    - andwherein the cloud controller is configured to read and analyze the block metadata in parallel with downloading additional portions of the cloud file to determine a set of actions to be taken for the data blocks of the cloud file.
  - 15. The non-transitory computer-readable storage medium of claim 14, wherein the method further comprises:
    - determining from the metadata hierarchy a set of file identifiers for files and directories that are logically proximate to the target file in the distributed filesystem; and
      
      comparing the file identifiers for the logically proximate files and directories with the file identifiers that are associated with the data blocks stored in the metadata index for the cloud file to determine a subset of the data blocks from the cloud file are likely to benefit from being opportunistically cached in the cloud controller;
      
      wherein opportunistic caching techniques leverage spatial and temporal locality to improve access performance for the distributed filesystem.
  - 16. The non-transitory computer-readable storage medium of claim 15, wherein comparing the two sets of file identifiers further comprises:
    - determining that a portion of the cloud file is composed of data blocks that are not likely to be needed in the cloud controller; and
      
      not downloading the portion of the cloud file containing the unneeded data blocks.
  - 17. The non-transitory computer-readable storage medium of claim 15,wherein the cloud file is serially encrypted such that all data blocks preceding any needed data blocks need to be downloaded to decrypt the needed data blocks;
    - andwherein comparing the two sets of file identifiers further comprises;
      
      determining that an end portion of the cloud file is composed of data blocks that are not likely to be needed in the cloud controller; and
      
      terminating the download of the end portion of the cloud file.
  - 18. The non-transitory computer-readable storage medium of claim 15, wherein an entry in the block metadata specifies:
    - a unique filename identifier for an associated data block in the cloud file;
      
      a compression technique used to compress the associated data block;
      
      the logical size of the associated data block;
      
      the physical size of the associated data block;
      
      a checksum for the associated data block;
      
      the checksum technique used to calculate the checksum; and
      
      the type of the checksum.
  - 19. The non-transitory computer-readable storage medium of claim 15, wherein determining a subset of the data blocks from the cloud file that will be opportunistically cached in the cloud controller comprises using a locality policy that specifies a level of pre fetching and opportunistic caching for at least one of the cloud controller and the portion of the distributed filesystem that includes the target file.

20. A cloud controller that transfers and caches a cloud file in a distributed filesystem, comprising:
- a processor;
  
  a storage mechanism that stores metadata for the distributed filesystem; and
  
  a storage management mechanism;
  
  wherein two or more cloud controllers collectively manage the data of the distributed filesystem, wherein the cloud controllers cache and ensure data consistency for data stored in a cloud storage system,wherein each cloud controller maintains a metadata hierarchy that reflects the current state of the distributed filesystem, wherein changes to the metadata for the distributed filesystem are communicated to the set of cloud controllers for the distributed filesystem to ensure that the clients of the distributed filesystem share a consistent view of the files in the distributed filesystem, wherein the cloud storage system stores groups of multiple data blocks for the distributed filesystem in cloud files, wherein a cloud file comprises a set of data blocks from multiple distinct distributed filesystem files and a metadata index that describes the set of data blocks stored in the cloud file;
  
  wherein the cloud controller is configured to receive a request from a client for a data block of a target file in the distributed filesystem, wherein the requested data block is not currently cached in the cloud controller;
  
  wherein the storage management mechanism is configured to initiate a transfer for a cloud file containing the requested data block from the cloud storage system to the cloud controller, wherein the metadata hierarchy facilitates identifying which cloud file contains the requested data block of the target file but the metadata hierarchy does not include a reverse mapping that facilitates quickly determining the location of other data blocks in the cloud file in the metadata hierarchy of the distributed filesystem;
  
  wherein, while the cloud file has not yet completely been downloaded from the cloud storage system to the cloud controller, the cloud controller is configured to already extract the metadata index for the cloud file from an initial portion of the cloud file that has already been downloaded to the cloud controller; and
  
  wherein, while portions of the cloud file are still being transferred to the cloud controller, the storage management mechanism is further configured;
  
  use the metadata index and the metadata hierarchy to determine whether other data blocks in the cloud file are likely to be accessed in a substantially similar timeframe as the requested data block; and
  
  download from the cloud storage system to the cloud controller a limited subset of data blocks from the cloud file that include (1) the requested data block;
  
  (2) the other data blocks from the cloud file that have been determined to be likely to be accessed; and
  
  (3) any blocks of the cloud file that are needed to decrypt the portions of the cloud file containing (1) and (2), wherein not downloading and caching the entire cloud file reduces bandwidth usage and improves cache access performance for the distributed filesystem.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Panzura, Inc.
Original Assignee
Panzura, Inc.
Inventors
Taylor, John Richard, Chou, Randy Yen-pang, Davis, Andrew P.
Primary Examiner(s)
Rahman, SM

Application Number

US13/769,213
Time in Patent Office

1,775 Days
Field of Search

709219
US Class Current
CPC Class Codes

G06F 11/1435   using file system or storag...

G06F 11/2089   Redundant storage control f...

G06F 11/2094   Redundant storage or storag...

G06F 16/182   Distributed file systems

G06F 16/1844   Management specifically ada...

G06F 2201/85   Active fault masking withou...

Transferring and caching a cloud file in a distributed filesystem

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

76 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Transferring and caching a cloud file in a distributed filesystem

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

76 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links