Performing deduplication in a distributed filesystem
First Claim
1. A computer-implemented method for performing distributed deduplication in a distributed filesystem, the method comprising:
- collectively managing the data of the distributed filesystem using two or more cloud controllers, wherein collectively managing the data comprises;
storing the data for the distributed filesystem in one or more cloud storage systems, wherein the cloud controllers cache and ensure data consistency for data stored in the cloud storage systems; and
maintaining in each cloud controller a metadata hierarchy that reflects the current state of the distributed filesystem, wherein changes to the metadata for the distributed filesystem are synchronized across the cloud controllers, wherein synchronizing metadata includes synchronizing deduplication information across the cloud controllers, wherein each cloud controller maintains a deduplication table that tracks deduplicated data for the distributed filesystem; and
receiving at a cloud controller an incremental metadata snapshot from a remote cloud controller, wherein the incremental metadata snapshot comprises deduplication information for a file that was received by the remote cloud controller, wherein the cloud controller, the remote cloud controller and the cloud storage system are all distinct computing devices;
adding the deduplication information from the incremental metadata snapshot to the deduplication table on the cloud controller;
receiving at the cloud controller a client write request that comprises new file data;
using the deduplication table on the cloud controller to determine that one or more data blocks in the new file data have previously been written to the distributed filesystem by the remote cloud controller;
updating the metadata hierarchy and the deduplication table for the cloud controller to link the metadata for these duplicate new data blocks with the location of the deduplicated data blocks in the cloud storage system and cache of the cloud controller; and
distributing a subsequent incremental metadata update from the cloud controller to the other cloud controllers for the distributed filesystem that notifies the other cloud controllers of the new file data and includes deduplication updates related to the new file data that enable the other cloud controllers to update the reference counts and entries in their own respective deduplication tables to reflect the addition of the new file data to the distributed filesystem.
9 Assignments
0 Petitions
Accused Products
Abstract
The disclosed embodiments provide techniques for performing deduplication for a distributed filesystem. Two or more cloud controllers collectively manage distributed filesystem data that is stored in one or more cloud storage systems; the cloud controllers cache and ensure data consistency for the stored data. During operation, a cloud controller receives an incremental metadata snapshot that references new data that was added to the distributed filesystem by a remote cloud controller. The cloud controller extracts a set of deduplication information from this incremental metadata snapshot. Upon receiving a subsequent client write request (e.g., a file write that includes one or more data blocks), the cloud controller uses the extracted deduplication information to determine that one or more data blocks in the client write request have already been written to the distributed filesystem.
-
Citations
19 Claims
-
1. A computer-implemented method for performing distributed deduplication in a distributed filesystem, the method comprising:
-
collectively managing the data of the distributed filesystem using two or more cloud controllers, wherein collectively managing the data comprises; storing the data for the distributed filesystem in one or more cloud storage systems, wherein the cloud controllers cache and ensure data consistency for data stored in the cloud storage systems; and maintaining in each cloud controller a metadata hierarchy that reflects the current state of the distributed filesystem, wherein changes to the metadata for the distributed filesystem are synchronized across the cloud controllers, wherein synchronizing metadata includes synchronizing deduplication information across the cloud controllers, wherein each cloud controller maintains a deduplication table that tracks deduplicated data for the distributed filesystem; and receiving at a cloud controller an incremental metadata snapshot from a remote cloud controller, wherein the incremental metadata snapshot comprises deduplication information for a file that was received by the remote cloud controller, wherein the cloud controller, the remote cloud controller and the cloud storage system are all distinct computing devices; adding the deduplication information from the incremental metadata snapshot to the deduplication table on the cloud controller; receiving at the cloud controller a client write request that comprises new file data; using the deduplication table on the cloud controller to determine that one or more data blocks in the new file data have previously been written to the distributed filesystem by the remote cloud controller; updating the metadata hierarchy and the deduplication table for the cloud controller to link the metadata for these duplicate new data blocks with the location of the deduplicated data blocks in the cloud storage system and cache of the cloud controller; and distributing a subsequent incremental metadata update from the cloud controller to the other cloud controllers for the distributed filesystem that notifies the other cloud controllers of the new file data and includes deduplication updates related to the new file data that enable the other cloud controllers to update the reference counts and entries in their own respective deduplication tables to reflect the addition of the new file data to the distributed filesystem. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A non-transitory computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for performing distributed deduplication in a distributed filesystem, the method comprising:
-
collectively managing the data of the distributed filesystem using two or more cloud controllers, wherein collectively managing the data comprises; storing the data for the distributed filesystem in one or more cloud storage systems, wherein the cloud controllers cache and ensure data consistency for data stored in the cloud storage systems; and maintaining in each cloud controller a metadata hierarchy that reflects the current state of the distributed filesystem, wherein changes to the metadata for the distributed filesystem are synchronized across the cloud controllers, wherein synchronizing metadata includes synchronizing deduplication information across the cloud controllers, wherein each cloud controller maintains a deduplication table that tracks deduplicated data for the distributed filesystem; and receiving at a cloud controller an incremental metadata snapshot from a remote cloud controller, wherein the incremental metadata snapshot comprises deduplication information for a file that was received by the remote cloud controller, wherein the cloud controller, the remote cloud controller and the cloud storage system are all distinct computing devices; adding the deduplication information from the incremental metadata snapshot to the deduplication table on the cloud controller; receiving at the cloud controller a client write request that comprises new file data; using the deduplication table on the cloud controller to determine that one or more data blocks in the new file data have previously been written to the distributed filesystem by the remote cloud controller; updating the metadata hierarchy and the deduplication table for the cloud controller to link the metadata for these duplicate new data blocks with the location of the deduplicated data blocks in the cloud storage system and cache of the cloud controller; and distributing a subsequent incremental metadata update from the cloud controller to the other cloud controllers for the distributed filesystem that notifies the other cloud controllers of the new file data and includes deduplication updates related to the new file data that enable the other cloud controllers to update the reference counts and entries in their own respective deduplication tables to reflect the addition of the new file data to the distributed filesystem. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A cloud controller that performs distributed deduplication for a distributed filesystem, comprising:
-
a processor; a storage mechanism that stores metadata for the distributed filesystem; and a storage management mechanism; wherein two or more cloud controllers collectively manage the data of the distributed filesystem, wherein collectively managing the data comprises; storing the data for the distributed filesystem in one or more cloud storage systems, wherein the storage management mechanisms of the cloud controllers are configured to cache and ensure data consistency for data stored in the cloud storage systems; and maintaining in each cloud controller a metadata hierarchy that reflects the current state of the distributed filesystem, wherein changes to the metadata for the distributed filesystem are synchronized across the cloud controllers, wherein synchronizing metadata includes synchronizing deduplication information across the cloud controllers, wherein each cloud controller maintains a deduplication table that tracks deduplicated data for the distributed filesystem; and wherein the storage management mechanism is further configured to; receive an incremental metadata snapshot distributed by a remote cloud controller, wherein the incremental metadata snapshot comprises deduplication information for a file that was received by the remote cloud controller, wherein the cloud controller, the remote cloud controller and the cloud storage system are all distinct computing devices; add the set of deduplication information from the incremental metadata snapshot to the deduplication table on the cloud controller; receive at the cloud controller a client write request that comprises new file data; use the deduplication table on the cloud controller to determine that one or more data blocks in the client write request have previously been written to the distributed filesystem by the remote cloud controller; update the metadata hierarchy and the deduplication table for the cloud controller to link the metadata for these duplicate new data blocks with the location of the deduplicated data blocks in the cloud storage system and cache of the cloud controller; and distribute a subsequent incremental metadata update from the cloud controller to the other cloud controllers for the distributed filesystem that notifies the other cloud controllers of the new file data and includes deduplication updates related to the new file data that enable the other cloud controllers to update the reference counts and entries in their own respective deduplication tables to reflect the addition of the new file data to the distributed filesystem.
-
Specification