Memory efficient sanitization of a deduplicated storage system
First Claim
1. A computer-implemented method for sanitizing a storage system, the method comprising:
- for each of a plurality of files stored in a file system of the storage system,obtaining a list of fingerprints representing data chunks of the file from a checkpointed on disk fingerprint-to-container (FTC) index, wherein the data chunks are deduplicated data chunks, and wherein at least one data chunk is referenced by multiple files in the file system;
for each of the fingerprints,performing a lookup operation based on the fingerprint in a cache storing a plurality of cache entries, each mapping a fingerprint to a container identifier (ID) storing the corresponding data chunk and a chunk ID indicating a storage location of the data chunk within the container;
identifying a first container ID identifying a first container storing a data chunk corresponding to the fingerprint from a first cache entry matching the fingerprint,determining from the first cache entry a first chunk ID identifying a storage location of the first container in which the data chunk is stored, andin response to determining that the fingerprint is not found in the cache;
looking up the fingerprint in the FTC index to identify the first container ID storing the corresponding data chunk represented by the fingerprint;
reading, into the cache, metadata of the first container having the first container ID; and
looking up the first chunk ID, using the fingerprint, in the metadata of the first container having the first container ID;
populating a bit in a copy bit vector (CBV) based on the first container ID and the first chunk ID, the CBV including a plurality of bits and each storing a bit value indicating whether a data chunk is to be copied, wherein a data chunk with a corresponding bit having a predetermined bit value in the CBV is a live data chunk, wherein a live data chunk is referenced by at least one of the files in the file system;
after all of the bits corresponding to the fingerprints in the plurality of files have been populated in the CBV, copying live data chunks represented by the CBV from the first container to a second container; and
erasing records of the data chunks in the first container after the live data chunks of the first container indicated by the CBV have been copied to the second container to reclaim a storage space associated with the first container, including padding a predetermined data value in the first container, and releasing the first container back to a pool of free containers for future reuse.
9 Assignments
0 Petitions
Accused Products
Abstract
Techniques for sanitizing a storage system are described herein. In one embodiment, for each file stored in the storage system, a list of fingerprints representing data chunks of the file is obtained. In such an embodiment, for each of the fingerprints, identifying a first container storing a data chunk corresponding to the fingerprint is identified, and determining a storage location of the first container in which the data chunk is stored is determined. In one embodiment, a bit in copy bit vector (CBV) is populated based on the identified container and the storage location. In one embodiment, after all of the bits corresponding to the data chunks of the first container have been populated in the CBV, data chunks represented by the CBV are copied from the first container to a second container, and records of the data chunks in the first container are erased.
-
Citations
21 Claims
-
1. A computer-implemented method for sanitizing a storage system, the method comprising:
-
for each of a plurality of files stored in a file system of the storage system, obtaining a list of fingerprints representing data chunks of the file from a checkpointed on disk fingerprint-to-container (FTC) index, wherein the data chunks are deduplicated data chunks, and wherein at least one data chunk is referenced by multiple files in the file system; for each of the fingerprints, performing a lookup operation based on the fingerprint in a cache storing a plurality of cache entries, each mapping a fingerprint to a container identifier (ID) storing the corresponding data chunk and a chunk ID indicating a storage location of the data chunk within the container; identifying a first container ID identifying a first container storing a data chunk corresponding to the fingerprint from a first cache entry matching the fingerprint, determining from the first cache entry a first chunk ID identifying a storage location of the first container in which the data chunk is stored, and in response to determining that the fingerprint is not found in the cache; looking up the fingerprint in the FTC index to identify the first container ID storing the corresponding data chunk represented by the fingerprint; reading, into the cache, metadata of the first container having the first container ID; and looking up the first chunk ID, using the fingerprint, in the metadata of the first container having the first container ID; populating a bit in a copy bit vector (CBV) based on the first container ID and the first chunk ID, the CBV including a plurality of bits and each storing a bit value indicating whether a data chunk is to be copied, wherein a data chunk with a corresponding bit having a predetermined bit value in the CBV is a live data chunk, wherein a live data chunk is referenced by at least one of the files in the file system; after all of the bits corresponding to the fingerprints in the plurality of files have been populated in the CBV, copying live data chunks represented by the CBV from the first container to a second container; and erasing records of the data chunks in the first container after the live data chunks of the first container indicated by the CBV have been copied to the second container to reclaim a storage space associated with the first container, including padding a predetermined data value in the first container, and releasing the first container back to a pool of free containers for future reuse. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A non-transitory computer-readable medium having instructions stored therein, which when executed by a computer, cause the computer to perform operations, the operations comprising:
-
for each of a plurality of files stored in a file system of the storage system, obtaining a list of fingerprints representing data chunks of the file from a checkpointed on disk fingerprint-to-container (FTC) index, wherein the data chunks are deduplicated data chunks, and wherein at least one data chunk is referenced by multiple files in the file system; for each of the fingerprints, performing a lookup operation based on the fingerprint in a cache storing a plurality of cache entries, each mapping a fingerprint to a container identifier (ID) storing the corresponding data chunk and a chunk ID indicating a storage location of the data chunk within the container, identifying a first container ID identifying a first container storing a data chunk corresponding to the fingerprint from a first cache entry matching the fingerprint, determining from the first cache entry a first chunk ID identifying a storage location of the first container in which the data chunk is stored, and in response to determining that the fingerprint is not found in the cache; looking UP fingerprint in the FTC index to identify the first container ID storing the corresponding data chunk represented by the fingerprint; reading, into the cache, metadata of the first container having the first container ID; and looking UP the first chunk ID, using the fingerprint, in the metadata of the first container having the first container ID; populating a bit in a copy bit vector (CBV) based on the first container ID and the first chunk ID, the CBV including a plurality of bits and each storing a bit value indicating whether a data chunk is to be copied, wherein a data chunk with a corresponding bit having a predetermined bit value in the CBV is a live data chunk, wherein a live data chunk is referenced by at least one of the files in the file system; after all of the bits corresponding to the fingerprints in the plurality of files have been populated in the CBV, copying live data chunks represented by the CBV from the first container to a second container; and erasing records of the data chunks in the first container after the live data chunks of the first container indicated by the CBV have been copied to the second container to reclaim a storage space associated with the first container, including padding a predetermined data value in the first container, and releasing the first container back to a pool of free containers for future reuse. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A data processing system, comprising:
-
a processor; and a memory to store instructions, which when executed from the memory, cause the processor to perform operations, the operations including for each of a plurality of files stored in a file system of the storage system, obtaining a list of fingerprints representing data chunks of the file from a checkpointed on disk fingerprint-to-container (FTC) index, wherein the data chunks are deduplicated data chunks, and wherein at least one data chunk is referenced by multiple files in the file system; for each of the fingerprints, performing a lookup operation based on the fingerprint in a cache storing a plurality of cache entries, each mapping a fingerprint to a container identifier (ID) storing the corresponding data chunk and a chunk ID indicating a storage location of the data chunk within the container, identifying a first container ID identifying a first container storing a data chunk corresponding to the fingerprint from a first cache entry matching the fingerprint, determining from the first cache entry a first chunk ID identifying a storage location of the first container in which the data chunk is stored, and in response to determining that the fingerprint is not found in the cache; looking UP the fingerprint in the FTC index to identify the first container ID storing the corresponding data chunk represented by the fingerprint; reading, into the cache, metadata of the first container having the first container ID; and looking UP the first chunk ID, using the fingerprint, in the metadata of the first container having the first container ID; populating a bit in a copy bit vector (CBV) based on the first container ID and the first chunk ID, the CBV including a plurality of bits and each storing a bit value indicating whether a data chunk is to be copied, wherein a data chunk with a corresponding bit having a predetermined bit value in the CBV is a live data chunk, wherein a live data chunk is referenced by at least one of the files in the file system; after all of the bits corresponding to the fingerprints in the plurality of files have been populated in the CBV, copying live data chunks represented by the CBV from the first container to a second container; and erasing records of the data chunks in the first container after the live data chunks of the first container indicated by the CBV have been copied to the second container to reclaim a storage space associated with the first container, including padding a predetermined data value in the first container, and releasing the first container back to a pool of free containers for future reuse. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification