Identification of high deduplication data

US 10,255,290 B2
Filed: 04/17/2018
Issued: 04/09/2019
Est. Priority Date: 11/30/2016
Status: Active Grant

First Claim

Patent Images

1. A computer program product for identifying portions of a dataset with high deduplication potential, the computer program product comprising one or more computer readable storage media and program instructions stored on said one or more computer readable storage media, said program instructions comprising instructions to:

divide a dataset into a plurality of regions, wherein;

the dataset includes a plurality of logical entities; and

each logical entity of the plurality of logical entities includes one or more regions of the plurality of regions;

divide the plurality of regions into a plurality of chunks of fixed size;

determine a sample size of the plurality of chunks to be sampled for each region of the plurality of regions, wherein the sample size is determined based, at least in part, on;

an acceptance of a likelihood of identifying at least one collision between a first region corresponding to a first logical entity of the plurality of logical entities and a second region corresponding to a second logical entity of the plurality of logical entities of a first cluster of logical entities, wherein;

the first cluster of logical entities includes at least the first logical entity and the second logical entity; and

the likelihood of identifying at least the one collision between the first region corresponding to the first logical entity and the second region corresponding to the second logical entity of the first cluster of logical entities is based, at least in part, on instructions to;

identify a cluster size for the first cluster of logical entities; and

determine a degree of similarity between the first logical entity and the second logical entity of the first cluster of logical entities;

sample the plurality of chunks for each region based on the determined sample size;

generate a hash value for each chunk of the plurality of chunks sampled;

store each hash value in an index, wherein storing each hash value comprises instructions to;

store a first location of the region corresponding to the hash value in the index; and

store a second location within the region corresponding to the hash value in the index;

identify a plurality of collisions between the plurality of regions, wherein each collision of the plurality of collisions denotes that two or more regions of the plurality of regions share an identical hash value;

determine that a region of the plurality of regions of the dataset includes deduplicatable data, wherein the region of the plurality of the dataset includes deduplicatable data if the region shares an identical hash value with another region of the plurality of regions;

mark a first subset of the plurality of regions based on identifying a series of regions, wherein each region in the series of regions has at least one collision with at least one other region;

mark a second subset of the plurality of regions based on identifying at least one collision between each region of a number of regions of the series of regions, wherein the number of regions exceeds a given threshold;

mark a third subset of the plurality of regions based on identifying a number of collisions between each region of a number of regions of the series of regions, wherein;

the number of collisions exceeds a first given threshold; and

the number of regions exceeds a second given threshold;

separate at least one of the first subset, the second subset, and the third subset of the plurality of regions from the dataset based, at least in part, on available computing resources of a first storage system that supports data deduplication;

migrate at least one of the first subset, the second subset, and the third subset of the plurality of regions separated from the dataset to the first storage system; and

migrate those regions in the plurality of regions that are unmarked to a second storage system that does not support data deduplication.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method includes dividing a data set into a plurality of regions and dividing the plurality of regions into a plurality of chunks of fixed size. The computer-implemented method further includes determining a sample size of the plurality of chunks to be sampled for each region, wherein the sample size is determined based, at least in part, on an acceptance of a likelihood of identifying at least one collision between two regions corresponding to logical entities of a first cluster of logical entities. The computer-implemented method further includes sampling the plurality of chunks for each region based on the determined sample size. The computer-implemented method further includes generating a hash value for each chunk sampled and storing each hash value in an index. The computer-implemented method further includes identifying one or more collisions between the plurality of regions. A corresponding computer system and computer program product are also disclosed.

18 Citations

View as Search Results

1 Claim

1. A computer program product for identifying portions of a dataset with high deduplication potential, the computer program product comprising one or more computer readable storage media and program instructions stored on said one or more computer readable storage media, said program instructions comprising instructions to:
- divide a dataset into a plurality of regions, wherein;
  
  the dataset includes a plurality of logical entities; and
  
  each logical entity of the plurality of logical entities includes one or more regions of the plurality of regions;
  
  divide the plurality of regions into a plurality of chunks of fixed size;
  
  determine a sample size of the plurality of chunks to be sampled for each region of the plurality of regions, wherein the sample size is determined based, at least in part, on;
  
  an acceptance of a likelihood of identifying at least one collision between a first region corresponding to a first logical entity of the plurality of logical entities and a second region corresponding to a second logical entity of the plurality of logical entities of a first cluster of logical entities, wherein;
  
  the first cluster of logical entities includes at least the first logical entity and the second logical entity; and
  
  the likelihood of identifying at least the one collision between the first region corresponding to the first logical entity and the second region corresponding to the second logical entity of the first cluster of logical entities is based, at least in part, on instructions to;
  
  identify a cluster size for the first cluster of logical entities; and
  
  determine a degree of similarity between the first logical entity and the second logical entity of the first cluster of logical entities;
  
  sample the plurality of chunks for each region based on the determined sample size;
  
  generate a hash value for each chunk of the plurality of chunks sampled;
  
  store each hash value in an index, wherein storing each hash value comprises instructions to;
  
  store a first location of the region corresponding to the hash value in the index; and
  
  store a second location within the region corresponding to the hash value in the index;
  
  identify a plurality of collisions between the plurality of regions, wherein each collision of the plurality of collisions denotes that two or more regions of the plurality of regions share an identical hash value;
  
  determine that a region of the plurality of regions of the dataset includes deduplicatable data, wherein the region of the plurality of the dataset includes deduplicatable data if the region shares an identical hash value with another region of the plurality of regions;
  
  mark a first subset of the plurality of regions based on identifying a series of regions, wherein each region in the series of regions has at least one collision with at least one other region;
  
  mark a second subset of the plurality of regions based on identifying at least one collision between each region of a number of regions of the series of regions, wherein the number of regions exceeds a given threshold;
  
  mark a third subset of the plurality of regions based on identifying a number of collisions between each region of a number of regions of the series of regions, wherein;
  
  the number of collisions exceeds a first given threshold; and
  
  the number of regions exceeds a second given threshold;
  
  separate at least one of the first subset, the second subset, and the third subset of the plurality of regions from the dataset based, at least in part, on available computing resources of a first storage system that supports data deduplication;
  
  migrate at least one of the first subset, the second subset, and the third subset of the plurality of regions separated from the dataset to the first storage system; and
  
  migrate those regions in the plurality of regions that are unmarked to a second storage system that does not support data deduplication.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Harnik, Danny, Khaitzin, Ety, Marenkov, Sergey, Sotnikov, Dmitry
Primary Examiner(s)
Shanmugasundaram, Kannan

Application Number

US15/954,702
Publication Number

US 20180225300A1
Time in Patent Office

357 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/1748 De-duplication implemented ...

G06F 16/1752 based on file chunks

Identification of high deduplication data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

18 Citations

1 Claim

Specification

Solutions

Use Cases

Quick Links

Identification of high deduplication data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

18 Citations

1 Claim

Specification

Subscription Required

Solutions

Use Cases

Quick Links