×

Sampling-based deduplication estimation

  • US 10,198,455 B2
  • Filed: 01/13/2016
  • Issued: 02/05/2019
  • Est. Priority Date: 01/13/2016
  • Status: Active Grant
First Claim
Patent Images

1. A method for determining a deduplication rate of logical data units in a dataset, comprising:

  • randomly selecting, from a dataset partitioned into a first number of logical data units, a second number of logical data units determined by a sampling ratio, to be included in a sample group;

    calculating, by a processor, respective hash values for each of the selected logical data units;

    computing a first histogram indicating a duplication count of logical data units in the sample group, using the calculated hash values;

    computing, based on respective frequencies of the calculated hash values, a second histogram indicating respective frequencies of the duplication counts in the first histogram;

    deriving a third histogram of predicted respective frequencies of duplication counts in the set of logical data units, by performing an optimization method, with a target function that minimizes a distance between the second histogram and the result of applying a sampling transformation with the specified sampling ratio on candidate third histograms; and

    determining, based on the third histogram, a deduplication ratio for the dataset.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×