Packing deduplicated data into finite-sized containers

US 9,880,771 B2
Filed: 06/19/2012
Issued: 01/30/2018
Est. Priority Date: 06/19/2012
Status: Active Grant

First Claim

Patent Images

1. A method for rehydrating deduplicated data, by packing the deduplicated data into a plurality of finite-sized containers using a processor device, comprising:

calculating a similarity score between a plurality of similarly compared files of the deduplicated data, the similarity score indicating an overall deduplication ratio between the similarly compared files of the deduplicated data;

wherein the similarly compared files are at least 1 Gigabyte (GB) in size, wherein calculating the similarity score further includes calculating an nth percentage threshold of common data intersections shared between the plurality of similarly compared files of the deduplicated data, and wherein a transitive closure between the plurality of similarly compared files of the deduplicated data is determined,using the similarity score for grouping the plurality of similarly compared files of the deduplicated data into subsets for destaging each of the subsets from a deduplication system to one of the plurality of finite-sized containers;

wherein a sum a data space of all of the plurality of the plurality of finite-sized containers is substantially equal to the overall deduplication ratio,receiving an indication by a user which of the plurality of similarly compared files are to be grouped into the subsets for destaging each of the subsets from a deduplication system to one of the plurality of finite-sized containers,using the transitive closures for assisting with using the similarity score for grouping the plurality of similarly compared files of the deduplicated data into the subsets, andcalculating a storage metric value by traversing the each of the subsets for determining a required storage space in one of the plurality of finite-sized containers.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Deduplicated data is packed into finite-sized containers. A similarity score is calculated between files that are similarly of the deduplicated data. The similarity score is used for grouping the similarly compared files of the deduplicated data into subsets for destaging each of the subsets from a deduplication system to one a finite-sized container.

Citations

9 Claims

1. A method for rehydrating deduplicated data, by packing the deduplicated data into a plurality of finite-sized containers using a processor device, comprising:
- calculating a similarity score between a plurality of similarly compared files of the deduplicated data, the similarity score indicating an overall deduplication ratio between the similarly compared files of the deduplicated data;
  
  wherein the similarly compared files are at least 1 Gigabyte (GB) in size, wherein calculating the similarity score further includes calculating an nth percentage threshold of common data intersections shared between the plurality of similarly compared files of the deduplicated data, and wherein a transitive closure between the plurality of similarly compared files of the deduplicated data is determined,using the similarity score for grouping the plurality of similarly compared files of the deduplicated data into subsets for destaging each of the subsets from a deduplication system to one of the plurality of finite-sized containers;
  
  wherein a sum a data space of all of the plurality of the plurality of finite-sized containers is substantially equal to the overall deduplication ratio,receiving an indication by a user which of the plurality of similarly compared files are to be grouped into the subsets for destaging each of the subsets from a deduplication system to one of the plurality of finite-sized containers,using the transitive closures for assisting with using the similarity score for grouping the plurality of similarly compared files of the deduplicated data into the subsets, andcalculating a storage metric value by traversing the each of the subsets for determining a required storage space in one of the plurality of finite-sized containers.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, further including comparing previously deduplicated data files in a deduplication system with new data files that are to be deduplicated into the deduplication system at ingestion time for creating the plurality of similarly compared files of the deduplicated data.
  - 3. The method of claim 1, further including maintaining in a file similarity index an identify of each of the plurality of similarly compared files and the similarity score calculated for each of the plurality of similarly compared files.

4. A system for rehydrating deduplicated data, by packing the deduplicated data into a plurality of finite-sized containers in a computing environment, comprising:
- a processor device, operable in the computing environment, wherein the at least one processor device is adapted for;
  
  calculating a similarity score between a plurality of similarly compared files of the deduplicated data, the similarity score indicating an overall deduplication ratio between the similarly compared files of the deduplicated data;
  
  wherein the similarly compared files are at least 1 Gigabyte (GB) in size, wherein calculating the similarity score further includes calculating an nth percentage threshold of common data intersections shared between the plurality of similarly compared files of the deduplicated data, and wherein a transitive closure between the plurality of similarly compared files of the deduplicated data is determined,using the similarity score for grouping the plurality of similarly compared files of the deduplicated data into subsets for destaging each of the subsets from a deduplication system to one of the plurality of finite-sized containers;
  
  wherein a sum a data space of all of the plurality of the plurality of finite-sized containers is substantially equal to the overall deduplication ratio,receiving an indication by a user which of the plurality of similarly compared files are to be grouped into the subsets for destaging each of the subsets from a deduplication system to one of the plurality of finite-sized containers,using the transitive closures for assisting with using the similarity score for grouping the plurality of similarly compared files of the deduplicated data into the subsets, andcalculating a storage metric value by traversing the each of the subsets for determining a required storage space in one of the plurality of finite-sized containers.
- View Dependent Claims (5, 6)
- - 5. The system of claim 4, wherein the processor device is further adapted for comparing previously deduplicated data files in a deduplication system with new data files that are to be deduplicated into the deduplication system at ingestion time for creating the plurality of similarly compared files of the deduplicated data.
  - 6. The system of claim 4, wherein the processor device is further adapted for maintaining in a file similarity index an identify of each of the plurality of similarly compared files and the similarity score calculated for each of the plurality of similarly compared files.

7. A computer program product for rehydrating deduplicated data, by packing the deduplicated data into a plurality of finite-sized containers by a processor device, the computer program product comprising a non-transitory computer-readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising:
- a first executable portion for calculating a similarity score between a plurality of similarly compared files of the deduplicated data, the similarity score indicating an overall deduplication ratio between the similarly compared files of the deduplicated data;
  
  wherein the similarly compared files are at least 1 Gigabyte (GB) in size, wherein calculating the similarity score further includes calculating an nth percentage threshold of common data intersections shared between the plurality of similarly compared files of the deduplicated data, and wherein a transitive closure between the plurality of similarly compared files of the deduplicated data is determined,a second executable portion for using the similarity score for grouping the plurality of similarly compared files of the deduplicated data into subsets for destaging each of the subsets from a deduplication system to one of the plurality of finite-sized containers;
  
  wherein a sum a data space of all of the plurality of the plurality of finite-sized containers is substantially equal to the overall deduplication ratio,a third executable portion for receiving an indication by a user which of the plurality of similarly compared files are to be grouped into the subsets for destaging each of the subsets from a deduplication system to one of the plurality of finite-sized containers,a fourth executable portion for using the transitive closures for assisting with using the similarity score for grouping the plurality of similarly compared files of the deduplicated data into the subsets, anda fifth executable portion for calculating a storage metric value by traversing the each of the subsets for determining a required storage space in one of the plurality of finite-sized containers.
- View Dependent Claims (8, 9)
- - 8. The computer program product of claim 7, further including a sixth executable portion for comparing previously deduplicated data files in a deduplication system with new data files that are to be deduplicated into the deduplication system at ingestion time for creating the plurality of similarly compared files of the deduplicated data.
  - 9. The computer program product of claim 7, further including a sixth executable portion for maintaining in a file similarity index an identify of each of the plurality of similarly compared files and the similarity score calculated for each of the plurality of similarly compared files.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Hirsch, Michael, Krause, Thorsten
Primary Examiner(s)
Le, Miranda

Application Number

US13/526,834
Publication Number

US 20130339316A1
Time in Patent Office

2,051 Days
Field of Search

707692, 7079992
US Class Current
CPC Class Codes

G06F 3/0608   Saving storage space on sto...

G06F 3/0641   De-duplication techniques

G06F 3/0689   Disk arrays, e.g. RAID, JBOD

Packing deduplicated data into finite-sized containers

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Packing deduplicated data into finite-sized containers

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links