Sampling-based deduplication estimation
First Claim
1. A method for determining a deduplication rate of logical data units in a dataset, comprising:
- randomly selecting, from a dataset partitioned into a first number of logical data units, a second number of logical data units determined by a sampling ratio, to be included in a sample group;
calculating, by a processor, respective hash values for each of the selected logical data units;
computing a first histogram indicating a duplication count of logical data units in the sample group, using the calculated hash values;
computing, based on respective frequencies of the calculated hash values, a second histogram indicating respective frequencies of the duplication counts in the first histogram;
deriving a third histogram of predicted respective frequencies of duplication counts in the set of logical data units, by performing an optimization method, with a target function that minimizes a distance between the second histogram and the result of applying a sampling transformation with the specified sampling ratio on candidate third histograms; and
determining, based on the third histogram, a deduplication ratio for the dataset.
3 Assignments
0 Petitions
Accused Products
Abstract
A method, including partitioning a dataset into a first number of data units, and selecting, based on a sampling ratio, a second number of the data units. A hash value is calculated for each of the selected data units, and a first histogram is computed indicating a first duplication count for each of the calculated hash values. Based on respective frequencies of the calculated hash values, a second histogram is computed indicating an observed frequency for each of the first duplication counts in the first histogram, and based on the sampling ratio and the second histogram, a target function is derived. A third histogram that minimizes the target function is derived, the third histogram including, for the first number of the storage units, second duplication counts and a respective predicted frequency for each of the second duplication counts. Finally, a deduplication ratio is determined based on the third histogram.
-
Citations
21 Claims
-
1. A method for determining a deduplication rate of logical data units in a dataset, comprising:
-
randomly selecting, from a dataset partitioned into a first number of logical data units, a second number of logical data units determined by a sampling ratio, to be included in a sample group; calculating, by a processor, respective hash values for each of the selected logical data units; computing a first histogram indicating a duplication count of logical data units in the sample group, using the calculated hash values; computing, based on respective frequencies of the calculated hash values, a second histogram indicating respective frequencies of the duplication counts in the first histogram; deriving a third histogram of predicted respective frequencies of duplication counts in the set of logical data units, by performing an optimization method, with a target function that minimizes a distance between the second histogram and the result of applying a sampling transformation with the specified sampling ratio on candidate third histograms; and determining, based on the third histogram, a deduplication ratio for the dataset. - View Dependent Claims (2, 3, 4, 5, 6, 7, 21)
-
-
8. An apparatus, comprising:
-
a storage device configured to store a dataset; and a processor configured; to randomly select, from a dataset partitioned into a first number of logical data units, a second number of logical data units determined by a sampling ratio, to be included in a sample group, to calculate respective hash values for each of the selected logical data units, to compute a first histogram indicating a duplication count of logical data units in the sample group, using the calculated hash values, to compute, based on respective frequencies of the calculated hash values, a second histogram indicating respective frequencies of the duplication counts in the first histogram, to derive a third histogram of predicted respective frequencies of duplication counts in the set of logical data units, by performing an optimization method, with a target function that minimizes a distance between the second histogram and the result of applying a sampling transformation with the specified sampling ratio on candidate third histograms, and to determine, based on the third histogram, a deduplication ratio for the dataset. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer program product, the computer program product comprising:
-
a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising; computer readable program code configured to randomly select, from a dataset partitioned into a first number of logical data units, a second number of logical data units determined by a sampling ratio, to be included in a sample group; computer readable program code configured to calculate respective hash values for each of the selected logical data units; computer readable program code configured to compute a first histogram indicating a duplication count of logical data units in the sample group, using the calculated hash values; computer readable program code configured to compute, based on respective frequencies of the calculated hash values, a second histogram indicating respective frequencies of the duplication counts in the first histogram; computer readable program code configured to derive a third histogram of predicted respective frequencies of duplication counts in the set of logical data units, by performing an optimization method, with a target function that minimizes a distance between the second histogram and the result of applying a sampling transformation with the specified sampling ratio on candidate third histograms; and computer readable program code configured to determine, based on the third histogram, a deduplication ratio for the dataset. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification