×

Identification of high deduplication data

  • US 10,255,290 B2
  • Filed: 04/17/2018
  • Issued: 04/09/2019
  • Est. Priority Date: 11/30/2016
  • Status: Active Grant
First Claim
Patent Images

1. A computer program product for identifying portions of a dataset with high deduplication potential, the computer program product comprising one or more computer readable storage media and program instructions stored on said one or more computer readable storage media, said program instructions comprising instructions to:

  • divide a dataset into a plurality of regions, wherein;

    the dataset includes a plurality of logical entities; and

    each logical entity of the plurality of logical entities includes one or more regions of the plurality of regions;

    divide the plurality of regions into a plurality of chunks of fixed size;

    determine a sample size of the plurality of chunks to be sampled for each region of the plurality of regions, wherein the sample size is determined based, at least in part, on;

    an acceptance of a likelihood of identifying at least one collision between a first region corresponding to a first logical entity of the plurality of logical entities and a second region corresponding to a second logical entity of the plurality of logical entities of a first cluster of logical entities, wherein;

    the first cluster of logical entities includes at least the first logical entity and the second logical entity; and

    the likelihood of identifying at least the one collision between the first region corresponding to the first logical entity and the second region corresponding to the second logical entity of the first cluster of logical entities is based, at least in part, on instructions to;

    identify a cluster size for the first cluster of logical entities; and

    determine a degree of similarity between the first logical entity and the second logical entity of the first cluster of logical entities;

    sample the plurality of chunks for each region based on the determined sample size;

    generate a hash value for each chunk of the plurality of chunks sampled;

    store each hash value in an index, wherein storing each hash value comprises instructions to;

    store a first location of the region corresponding to the hash value in the index; and

    store a second location within the region corresponding to the hash value in the index;

    identify a plurality of collisions between the plurality of regions, wherein each collision of the plurality of collisions denotes that two or more regions of the plurality of regions share an identical hash value;

    determine that a region of the plurality of regions of the dataset includes deduplicatable data, wherein the region of the plurality of the dataset includes deduplicatable data if the region shares an identical hash value with another region of the plurality of regions;

    mark a first subset of the plurality of regions based on identifying a series of regions, wherein each region in the series of regions has at least one collision with at least one other region;

    mark a second subset of the plurality of regions based on identifying at least one collision between each region of a number of regions of the series of regions, wherein the number of regions exceeds a given threshold;

    mark a third subset of the plurality of regions based on identifying a number of collisions between each region of a number of regions of the series of regions, wherein;

    the number of collisions exceeds a first given threshold; and

    the number of regions exceeds a second given threshold;

    separate at least one of the first subset, the second subset, and the third subset of the plurality of regions from the dataset based, at least in part, on available computing resources of a first storage system that supports data deduplication;

    migrate at least one of the first subset, the second subset, and the third subset of the plurality of regions separated from the dataset to the first storage system; and

    migrate those regions in the plurality of regions that are unmarked to a second storage system that does not support data deduplication.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×