Tuning global digests caching in a data deduplication system

US 9,892,048 B2
Filed: 07/15/2013
Issued: 02/13/2018
Est. Priority Date: 07/15/2013
Status: Expired due to Fees

First Claim

Patent Images

1. A method for tuning the density of a global digests cache in a data deduplication system using a processor device in a computing environment, comprising:

partitioning input data into input data chunks, each of the input data chunks having a size of at least 16 Megabytes (MB);

wherein input digest values are calculated for each of the input data chunks;

finding positions of similar repository data in a repository of data for each of the input data chunks;

locating and loading repository digests of the similar repository data into the global digests cache, wherein the global digests cache contains, prior to the loading of the repository digests of the similar repository data, at least a plurality of digests previously loaded by a plurality of deduplication operations;

loading a sample of the repository digests into a search mechanism within the global digests cache;

applying the sampling of the repository digests for loading the repository digests into a hash table; and

using the positions of the similar repository data to locate and linearly load into the global digests cache, digests and digest block boundaries of the similar repository data in a sequence corresponding to a placement order of calculated values of the digests of the similar repository data, the placement order of the calculated values of the digests of the similar repository data correlative to an order in which the input digest values were individually calculated such that the digests of the similar repository data are each individually stored in the global digests cache based on a calculation time and order of when each of the input digests were first calculated when in un-deduplicated form, thereby storing the digests of the similar repository data in a linear and sequential form independent of a deduplicated form by which data the digests describe is stored, wherein the global digest cache comprises a pool of a plurality of sequential arrays of digest entries of the digests and a hash table for pointing to contents within the plurality of sequential arrays.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Input data is partitioned into data chunks and digest values are calculated for each of the data chunks. The positions of similar repository data are found in a repository of data for each of the data chunks. The repository digests of the similar repository data are located and loaded into the global digests cache. The global digests cache contains digests previously loaded by other deduplication processes. The input digests of the input data are matched with the repository digests contained in the global digests cache for locating data matches. A sample of the repository digests is loaded into a search mechanism within the global digests cache.

38 Citations

24 Claims

1. A method for tuning the density of a global digests cache in a data deduplication system using a processor device in a computing environment, comprising:
- partitioning input data into input data chunks, each of the input data chunks having a size of at least 16 Megabytes (MB);
  
  wherein input digest values are calculated for each of the input data chunks;
  
  finding positions of similar repository data in a repository of data for each of the input data chunks;
  
  locating and loading repository digests of the similar repository data into the global digests cache, wherein the global digests cache contains, prior to the loading of the repository digests of the similar repository data, at least a plurality of digests previously loaded by a plurality of deduplication operations;
  
  loading a sample of the repository digests into a search mechanism within the global digests cache;
  
  applying the sampling of the repository digests for loading the repository digests into a hash table; and
  
  using the positions of the similar repository data to locate and linearly load into the global digests cache, digests and digest block boundaries of the similar repository data in a sequence corresponding to a placement order of calculated values of the digests of the similar repository data, the placement order of the calculated values of the digests of the similar repository data correlative to an order in which the input digest values were individually calculated such that the digests of the similar repository data are each individually stored in the global digests cache based on a calculation time and order of when each of the input digests were first calculated when in un-deduplicated form, thereby storing the digests of the similar repository data in a linear and sequential form independent of a deduplicated form by which data the digests describe is stored, wherein the global digest cache comprises a pool of a plurality of sequential arrays of digest entries of the digests and a hash table for pointing to contents within the plurality of sequential arrays.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the global digests cache contains the plurality of digests previously loaded by the plurality of deduplication processes.
  - 3. The method of claim 2, further including reusing at least one of the plurality of sequential arrays of digest entries of the global digests cache according to a least recently used (LRU) policy.
  - 4. The method of claim 3, further including applying the LRU policy on the plurality of sequential arrays of digest entries of the digest entries of the plurality of digests in the global digests cache.
  - 5. The method of claim 4, further including searching for the input digests by considering both the plurality of digests previously loaded by the plurality of deduplication processes and the digests of the similar repository data currently loaded into the global digests cache.
  - 6. The method of claim 1, further including performing one of:
    - calculating similarity values for each of the input data chunks,searching for matching similarity values in a search structure containing the similarity values, andmatching the digest values of the input data with the repository digest values of the repository digests loaded into the global digests cache for locating the data matches.
  - 7. The method of claim 1, further including incorporating into the sampling a first digest of each fixed sized sequence of the repository digests.
  - 8. The method of claim 1, further including performing one of:
    - determining a density of the sampling based on deduplication results of each of a plurality of sections of the input data, andtuning the density of the sampling for each of the plurality of sections of the input data in accordance with the deduplication results.

9. A system for tuning the density of a global digests cache in a data deduplication system of a computing environment, the system comprising:
- the data deduplication system;
  
  the global digests cache in association with the data deduplication system;
  
  a hash table included in the global digests cache;
  
  a search mechanism located within the global digests cache;
  
  a repository operating in the data deduplication system in communication with the global digests cache; and
  
  at least one processor device operable in the computing storage environment for controlling the data deduplication system, wherein the at least one processor device;
  
  partitions input data into input data chunks, each of the input data chunks having a size of at least 16 Megabytes (MB);
  
  wherein input digest values are calculated for each of the input data chunks,finds positions of similar repository data in a repository of data for each of the input data chunks,locates and loads repository digests of the similar repository data into the global digests cache, wherein the global digests cache contains, prior to the loading of the repository digests of the similar repository data, at least a plurality of digests previously loaded by a plurality of deduplication operations,loads a sample of the repository digests into a search mechanism within the global digests cache,applies the sampling of the repository digests for loading the repository digests into a hash table, anduses the positions of the similar repository data to locate and linearly load into the global digests cache, digests and digest block boundaries of the similar repository data in a sequence corresponding to a placement order of calculated values of the digests of the similar repository data, the placement order of the calculated values of the digests of the similar repository data correlative to an order in which the input digest values were individually calculated such that the digests of the similar repository data are each individually stored in the global digests cache based on a calculation time and order of when each of the input digests were first calculated when in un-deduplicated form, thereby storing the digests of the similar repository data in a linear and sequential form independent of a deduplicated form by which data the digests describe is stored, wherein the global digest cache comprises a pool of a plurality of sequential arrays of digest entries of the digests and a hash table for pointing to contents within the plurality of sequential arrays.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The system of claim 9, wherein the global digests cache contains the plurality of digests previously loaded by the plurality of deduplication processes.
  - 11. The system of claim 10, wherein the at least one processor device reuses at least one of the plurality of sequential arrays of digest entries of the global digests cache according to a least recently used (LRU) policy.
  - 12. The system of claim 11, wherein the at least one processor device applies the LRU policy on the plurality of sequential arrays of digest entries of the digest entries of the plurality of digests in the global digests cache.
  - 13. The system of claim 12, wherein the at least one processor device searches for the input digests by considering both the plurality of digests previously loaded by the plurality of deduplication processes and the digests of the similar repository data currently loaded into the global digests cache.
  - 14. The system of claim 9, wherein the at least one processor device performs one of:
    - calculating similarity values for each of the input data chunks,searching for matching similarity values in a search structure containing the similarity values, andmatching the digest values of the input data with the repository digest values of the repository digests loaded into the global digests cache for locating the data matches.
  - 15. The system of claim 9, wherein the at least one processor device incorporates into the sampling a first digest of each fixed sized sequence of the repository digests.
  - 16. The system of claim 9, wherein the at least one processor device performs one of:
    - determining a density of the sampling based on deduplication results of each of a plurality of sections of the input data, andtuning the density of the sampling for each of the plurality of sections of the input data in accordance with the deduplication results.

17. A computer program product for utilizing a global digests cache having a hash table in a data deduplication system using a processor device in a computing environment, the computer program product comprising a non-transitory computer readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising:
- a first executable portion that partitions input data into input data chunks, each of the input data chunks having a size of at least 16 Megabytes (MB);
  
  wherein input digest values are calculated for each of the input data chunks;
  
  a second executable portion that finds positions of similar repository data in a repository of data for each of the input data chunks;
  
  a third executable portion that locates and loads repository digests of the similar repository data into the global digests cache, wherein the global digests cache contains, prior to the loading of the repository digests of the similar repository data, at least a plurality of digests previously loaded by a plurality of deduplication operations;
  
  a fourth executable portion that loads a sample of the repository digests into a search mechanism within the global digests cache;
  
  a fifth executable portion that applies the sampling of the repository digests for loading the repository digests into a hash table; and
  
  a sixth executable portion that uses the positions of the similar repository data to locate and linearly load into the global digests cache, digests and digest block boundaries of the similar repository data in a sequence corresponding to a placement order of calculated values of the digests of the similar repository data, the placement order of the calculated values of the digests of the similar repository data correlative to an order in which the input digest values were individually calculated such that the digests of the similar repository data are each individually stored in the global digests cache based on a calculation time and order of when each of the input digests were first calculated when in un-deduplicated form, thereby storing the digests of the similar repository data in a linear and sequential form independent of a deduplicated form by which data the digests describe is stored, the global digest cache comprises a pool of a plurality of sequential arrays of digest entries of the digests and a hash table for pointing to contents within the plurality of sequential arrays.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
- - 18. The computer program product of claim 17, wherein the global digests cache contains the plurality of digests previously loaded by the plurality of deduplication processes.
  - 19. The computer program product of claim 18, further including a seventh executable portion that reuses at least one of the plurality of sequential arrays of digest entries of the global digests cache according to a least recently used (LRU) policy.
  - 20. The computer program product of claim 19, further including an eighth executable portion that applies the LRU policy on the plurality of sequential arrays of digest entries of the digest entries of the plurality of digests in the global digests cache.
  - 21. The computer program product of claim 20, further including a ninth executable portion that searches for the input digests by considering both the plurality of digests previously loaded by the plurality of deduplication processes and the digests of the similar repository data currently loaded into the global digests cache.
  - 22. The computer program product of claim 17, further including a seventh executable portion that performs one of:
    - calculating similarity values for each of the input data chunks,searching for matching similarity values in a search structure containing the similarity values, andmatching the digest values of the input data with the repository digest values of the repository digests loaded into the global digests cache for locating the data matches.
  - 23. The computer program product of claim 17, further including a seventh executable portion that incorporates into the sampling a first digest of each fixed sized sequence of the repository digests.
  - 24. The computer program product of claim 17, further including a seventh executable portion that performs one of:
    - determining a density of the sampling based on deduplication results of each of a plurality of sections of the input data, andtuning the density of the sampling for each of the plurality of sections of the input data in accordance with the deduplication results.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Akirav, Shay H., Aronovich, Lior
Primary Examiner(s)
Padmanabhan, Mano
Assistant Examiner(s)
Baughman, William E

Application Number

US13/941,958
Publication Number

US 20150019817A1
Time in Patent Office

1,674 Days
Field of Search
US Class Current
CPC Class Codes

G06F 12/0848   Partitioned cache, e.g. sep...

G06F 12/0875   with dedicated cache, e.g. ...

G06F 16/1752   based on file chunks

G06F 3/0619   in relation to data integri...

G06F 3/0641   De-duplication techniques

G06F 3/067   Distributed or networked st...

Y02D 10/00   Energy efficient computing,...

Tuning global digests caching in a data deduplication system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

38 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Tuning global digests caching in a data deduplication system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

38 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links