Sampling-based deduplication estimation

US 10,198,455 B2
Filed: 01/13/2016
Issued: 02/05/2019
Est. Priority Date: 01/13/2016
Status: Active Grant

First Claim

Patent Images

1. A method for determining a deduplication rate of logical data units in a dataset, comprising:

randomly selecting, from a dataset partitioned into a first number of logical data units, a second number of logical data units determined by a sampling ratio, to be included in a sample group;

calculating, by a processor, respective hash values for each of the selected logical data units;

computing a first histogram indicating a duplication count of logical data units in the sample group, using the calculated hash values;

computing, based on respective frequencies of the calculated hash values, a second histogram indicating respective frequencies of the duplication counts in the first histogram;

deriving a third histogram of predicted respective frequencies of duplication counts in the set of logical data units, by performing an optimization method, with a target function that minimizes a distance between the second histogram and the result of applying a sampling transformation with the specified sampling ratio on candidate third histograms; and

determining, based on the third histogram, a deduplication ratio for the dataset.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method, including partitioning a dataset into a first number of data units, and selecting, based on a sampling ratio, a second number of the data units. A hash value is calculated for each of the selected data units, and a first histogram is computed indicating a first duplication count for each of the calculated hash values. Based on respective frequencies of the calculated hash values, a second histogram is computed indicating an observed frequency for each of the first duplication counts in the first histogram, and based on the sampling ratio and the second histogram, a target function is derived. A third histogram that minimizes the target function is derived, the third histogram including, for the first number of the storage units, second duplication counts and a respective predicted frequency for each of the second duplication counts. Finally, a deduplication ratio is determined based on the third histogram.

Citations

21 Claims

1. A method for determining a deduplication rate of logical data units in a dataset, comprising:
- randomly selecting, from a dataset partitioned into a first number of logical data units, a second number of logical data units determined by a sampling ratio, to be included in a sample group;
  
  calculating, by a processor, respective hash values for each of the selected logical data units;
  
  computing a first histogram indicating a duplication count of logical data units in the sample group, using the calculated hash values;
  
  computing, based on respective frequencies of the calculated hash values, a second histogram indicating respective frequencies of the duplication counts in the first histogram;
  
  deriving a third histogram of predicted respective frequencies of duplication counts in the set of logical data units, by performing an optimization method, with a target function that minimizes a distance between the second histogram and the result of applying a sampling transformation with the specified sampling ratio on candidate third histograms; and
  
  determining, based on the third histogram, a deduplication ratio for the dataset.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 21)
- - 2. The method according to claim 1, wherein the logical data units comprise first logical data units having a first length, and wherein selecting the second number of the first logical data units comprises selecting a third number of second logical data units from the dataset, the second logical data units having a second length greater than the first length, and extracting, from the retrieved second logical data units, the second number of the first logical data units.
  - 3. The method according to claim 1, wherein the deduplication ratio indicates a first space savings for implementing deduplication on the logical storage units, and comprising estimating a compression ratio for each of the second number of the storage units, and determining a second space savings based on the compression ratios and the deduplication ratio.
  - 4. The method according to claim 3, wherein each of the duplication counts in the second histogram is associated with one or more of the logical data units, and comprising calculating, for each given deduplication count in the second histogram, an average of the compression ratios of the logical data units associated with the given deduplication count, weighting, for each of the deduplication counts in the second histogram, the respective observed frequency according to the respective average compression ratio, and weighting the third histogram based on the averages of the compression ratios.
  - 5. The method according to claim 1, wherein the first histogram comprises multiple entries for a given calculated hash value, each of the multiple entries incorporating, for the logical data unit associated with each given calculated hash value, one or more properties selected from a first group consisting of a length type, a physical location, a virtual location and a timestamp, the length type selected from a second group consisting of a fixed length and a variable length.
  - 6. The method according to claim 1, wherein identifying the third histogram that minimizes the target function comprises computing, on the second histogram, a calculation selected from a group consisting of a quadratic programming computation, a maximum likelihood computation and a linear programming computation.
  - 7. The method according to claim 1, wherein the first histogram indicates duplication counts for a subset of the calculated hash values.
  - 21. The method according to claim 1, and comprising deciding, based on the determined deduplication ratio, whether or not to perform deduplication on the dataset.

8. An apparatus, comprising:
- a storage device configured to store a dataset; and
  
  a processor configured;
  
  to randomly select, from a dataset partitioned into a first number of logical data units, a second number of logical data units determined by a sampling ratio, to be included in a sample group,to calculate respective hash values for each of the selected logical data units,to compute a first histogram indicating a duplication count of logical data units in the sample group, using the calculated hash values,to compute, based on respective frequencies of the calculated hash values, a second histogram indicating respective frequencies of the duplication counts in the first histogram,to derive a third histogram of predicted respective frequencies of duplication counts in the set of logical data units, by performing an optimization method, with a target function that minimizes a distance between the second histogram and the result of applying a sampling transformation with the specified sampling ratio on candidate third histograms,andto determine, based on the third histogram, a deduplication ratio for the dataset.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The apparatus according to claim 8, wherein the logical data units comprise first logical data units having a first length, and wherein the processor is configured to select the second number of the logical data units by selecting a third number of second logical data units from the dataset, the second logical data units having a second length greater than the first length, and extracting, from the second logical data units, the second number of the first logical data units.
  - 10. The apparatus according to claim 8, wherein the deduplication ratio indicates a first space savings for implementing deduplication on the logical storage units, and wherein the processor is configured to estimate a compression ratio for each of the second number of the storage units, and to determine a second space savings based on the compression ratios and the deduplication ratio.
  - 11. The apparatus according to claim 10 wherein each of the duplication counts in the second histogram is associated with one or more of the logical data units, and wherein the processor is configured to calculate, for each given deduplication count in the second histogram, an average of the compression ratios of the logical data associated with the given deduplication count, to weight, for each of the deduplication counts in the second histogram, the respective observed frequency according to the respective average compression ratio, and to weight the third histogram based on the averages of the compression ratios.
  - 12. The apparatus according to claim 8, wherein the first histogram comprises multiple entries for a given calculated hash value, each of the multiple entries incorporating, for the logical data unit associated with each given calculated hash value, one or more properties selected from a first group consisting of a length type, a physical location, a virtual location and a timestamp, the length type selected from a second group consisting of a fixed length and a variable length.
  - 13. The apparatus according to claim 8, wherein the processor is configured to identify the third histogram that minimizes the target function by computing, on the second histogram, a calculation selected from a group consisting of a quadratic programming computation, a maximum likelihood computation and a linear programming computation.
  - 14. The apparatus according to claim 8, wherein the first histogram indicates duplication counts for a subset of the calculated hash values.

15. A computer program product, the computer program product comprising:
- a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising;
  
  computer readable program code configured to randomly select, from a dataset partitioned into a first number of logical data units, a second number of logical data units determined by a sampling ratio, to be included in a sample group;
  
  computer readable program code configured to calculate respective hash values for each of the selected logical data units;
  
  computer readable program code configured to compute a first histogram indicating a duplication count of logical data units in the sample group, using the calculated hash values;
  
  computer readable program code configured to compute, based on respective frequencies of the calculated hash values, a second histogram indicating respective frequencies of the duplication counts in the first histogram;
  
  computer readable program code configured to derive a third histogram of predicted respective frequencies of duplication counts in the set of logical data units, by performing an optimization method, with a target function that minimizes a distance between the second histogram and the result of applying a sampling transformation with the specified sampling ratio on candidate third histograms;
  
  andcomputer readable program code configured to determine, based on the third histogram, a deduplication ratio for the dataset.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer program product according to claim 15, wherein the logical data units comprise first logical data units having a first length, and wherein the computer readable program code is configured to select the second number of the logical data units by selecting a third number of second logical data units from the dataset, the second logical data units having a second length greater than the first length, and extracting, from the second logical data units, the second number of the first logical data units.
  - 17. The computer program product according to claim 15, wherein the deduplication ratio indicates a first space savings for implementing deduplication on the logical storage units, and comprising computer readable program code configured to estimate a compression ratio for each of the second number of the storage units, and to determine a second space savings based on the compression ratios and the deduplication ratio.
  - 18. The computer program product according to claim 17, wherein each of the duplication counts in the second histogram is associated with one or more of the logical data units, and comprising computer readable program code configured to calculate, for each given deduplication count in the second histogram, an average of the compression ratios of the logical data associated with the given deduplication count, to weight, for each of the first deduplication counts in the second histogram, the respective observed frequency according to the respective average compression ratio, to weight the third histogram based on the average of the compression ratios.
  - 19. The computer program product according to claim 15, wherein the first histogram comprises multiple entries for a given calculated hash value, each of the multiple entries incorporating, for the logical data unit associated with each given calculated hash value, one or more properties selected from a first group consisting of a length type, a physical location, a virtual location and a timestamp, the length type selected from a second group consisting of a fixed length and a variable length.
  - 20. The computer program product according to claim 15, wherein the computer readable program code is configured to identify the third histogram that minimizes the target function by computing, on the second histogram, a calculation selected from a group consisting of a quadratic programming computation, a maximum likelihood computation and a linear programming computation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Harnik, Danny, Chambliss, David, Margalit, Oded, Sotnikov, Dmitry
Primary Examiner(s)
Lee, Wilson

Application Number

US14/994,161
Publication Number

US 20170199895A1
Time in Patent Office

1,119 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 16/137 Hash-based content-based in...

G06F 16/1752 based on file chunks

Sampling-based deduplication estimation

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Sampling-based deduplication estimation

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links