Global digests caching in a data deduplication system
First Claim
1. A method for utilizing a global digests cache in a deduplication process in a data deduplication system using a processor device in a computing environment, comprising:
- partitioning input data into input data chunks, each of the input data chunks having a size of at least 16 Megabytes (MB);
calculating input digest values for each of the input data chunks;
finding positions of similar repository data in a repository of data for each of the input data chunks;
locating and loading repository digests of the similar repository data into the global digests cache, wherein the global digests cache contains, prior to the loading of the repository digests of the similar repository data, at least a plurality of digests previously loaded by a plurality of deduplication processes;
matching input digests of the input data chunks and the repository digests contained in the global digests cache for locating data matches; and
using the positions of the similar repository data to locate and linearly load into the global digests cache, digests and digest block boundaries of the similar repository data in a sequence corresponding to a placement order of calculated values of the digests of the similar repository data, the placement order of the calculated values of the digests of the similar repository data correlative to an order in which the input digest values were individually calculated such that the digests of the similar repository data are each individually stored in the global digests cache based on a calculation time and order of when each of the input digests were first calculated when in un-deduplicated form, thereby storing the digests of the similar repository data in a linear and sequential form independent of a deduplicated form by which data the digests describe is stored, wherein the global digest cache comprises a pool of a plurality of sequential arrays of digest entries of the digests and a hash table for pointing to contents within the plurality of sequential arrays.
1 Assignment
0 Petitions
Accused Products
Abstract
For utilizing a global digests cache in deduplication processing in a data deduplication system using a processor device in a computing environment, input data is partitioned into data chunks and digest values are calculated for each of the data chunks. The positions of similar repository data are found in a repository of data for each of the data chunks. The repository digests of the similar repository data are located and loaded into the global digests cache. The global digests cache contains digests previously loaded by other deduplication processes. The input digests of the input data are matched with the repository digests contained in the global digests cache for locating data matches.
38 Citations
24 Claims
-
1. A method for utilizing a global digests cache in a deduplication process in a data deduplication system using a processor device in a computing environment, comprising:
-
partitioning input data into input data chunks, each of the input data chunks having a size of at least 16 Megabytes (MB); calculating input digest values for each of the input data chunks; finding positions of similar repository data in a repository of data for each of the input data chunks; locating and loading repository digests of the similar repository data into the global digests cache, wherein the global digests cache contains, prior to the loading of the repository digests of the similar repository data, at least a plurality of digests previously loaded by a plurality of deduplication processes; matching input digests of the input data chunks and the repository digests contained in the global digests cache for locating data matches; and using the positions of the similar repository data to locate and linearly load into the global digests cache, digests and digest block boundaries of the similar repository data in a sequence corresponding to a placement order of calculated values of the digests of the similar repository data, the placement order of the calculated values of the digests of the similar repository data correlative to an order in which the input digest values were individually calculated such that the digests of the similar repository data are each individually stored in the global digests cache based on a calculation time and order of when each of the input digests were first calculated when in un-deduplicated form, thereby storing the digests of the similar repository data in a linear and sequential form independent of a deduplicated form by which data the digests describe is stored, wherein the global digest cache comprises a pool of a plurality of sequential arrays of digest entries of the digests and a hash table for pointing to contents within the plurality of sequential arrays. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system for utilizing a global digests cache in a deduplication process for improving efficiency for internally reordered and high change rate workloads in a data deduplication system of a computing environment, the system comprising:
-
the data deduplication system; the global digests cache in association with the data deduplication system; a repository operating in the data deduplication system in communication with the global digests cache; and at least one processor device operable in the computing storage environment for controlling the data deduplication system, wherein the at least one processor device; partitions input data into input data chunks, each of the input data chunks having a size of at least 16 Megabytes (MB), calculates input digest values for each of the input data chunks, finds positions of similar repository data in a repository of data for each of the input data chunks, locates and loads repository digests of the similar repository data into the global digests cache, wherein the global digests cache contains, prior to the loading of the repository digests of the similar repository data, at least a plurality of digests previously loaded by a plurality of deduplication processes, matches input digests of the input data chunks and the repository digests contained in the global digests cache for locating data matches, and uses the positions of the similar repository data to locate and linearly load into the global digests cache, digests and digest block boundaries of the similar repository data in a sequence corresponding to a placement order of calculated values of the digests of the similar repository data, the placement order of the calculated values of the digests of the similar repository data correlative to an order in which the input digest values were individually calculated such that the digests of the similar repository data are each individually stored in the global digests cache based on a calculation time and order of when each of the input digests were first calculated when in un-deduplicated form, thereby storing the digests of the similar repository data in a linear and sequential form independent of a deduplicated form by which data the digests describe is stored, wherein the global digest cache comprises a pool of a plurality of sequential arrays of digest entries of the digests and a hash table for pointing to contents within the plurality of sequential arrays. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer program product for utilizing a global digests cache in a deduplication process for improving efficiency for internally reordered and high change rate workloads in a data deduplication system using a processor device in a computing environment, the computer program product comprising a non-transitory computer-readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising:
-
a first executable portion that partitions input data into input data chunks, each of the input data chunks having a size of at least 16 Megabytes (MB); a second executable portion that calculates input digest values for each of the input data chunks; a third executable portion that finds positions of similar repository data in a repository of data for each of the input data chunks; a fourth executable portion that locates and loads repository digests of the similar repository data into the global digests cache, wherein the global digests cache contains, prior to the loading of the repository digests of the similar repository data, at least a plurality of digests previously loaded by a plurality of deduplication processes; a fifth executable portion that matches input digests of the input data chunks and the repository digests contained in the global digests cache for locating data matches; and a sixth executable portion that uses the positions of the similar repository data to locate and linearly load into the global digests cache, digests and digest block boundaries of the similar repository data in a sequence corresponding to a placement order of calculated values of the digests of the similar repository data, the placement order of the calculated values of the digests of the similar repository data correlative to an order in which the input digest values were individually calculated such that the digests of the similar repository data are each individually stored in the global digests cache based on a calculation time and order of when each of the input digests were first calculated when in un-deduplicated form, thereby storing the digests of the similar repository data in a linear and sequential form independent of a deduplicated form by which data the digests describe is stored, wherein the global digest cache comprises a pool of a plurality of sequential arrays of digest entries of the digests and a hash table for pointing to contents within the plurality of sequential arrays. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
Specification