Producing alternative segmentations of data into blocks in a data deduplication system

US 9,922,042 B2
Filed: 07/15/2013
Issued: 03/20/2018
Est. Priority Date: 07/15/2013
Status: Active Grant

First Claim

Patent Images

1. A method for producing a plurality of segmentations of input data into blocks in a data deduplication system using a processor device in a computing environment, comprising:

calculating digests for an input data chunk using a primary segmentation by using a single linear scan of rolling hash values for calculating both the primary segmentation and similarity search values for the input data chunk, the input data chunk being at least 16 Megabytes (MB) in size;

obtaining and applying secondary segmentations for each one of a plurality of data mismatches based on reference data;

storing the primary segmentation and corresponding primary digests for the input data chunk in a sequence corresponding to a placement order of calculated values of the calculated digests associated with the primary digests, the placement order of the calculated values of the calculated digests correlative to an order in which input digest values were calculated such that the primary digests are stored in a linear form independent of a deduplicated form by which data the primary digests describe is stored;

obtaining the segmentations for each one of the data mismatches by considering input digests included in data matches preceding and following each one of the data mismatches; and

avoiding storing the secondary segmentations and corresponding secondary digests for the input data chunk.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

For producing secondary segmentations of data into blocks and corresponding digests for input data in a data deduplication system using a processor device in a computing environment, digests are calculated for an input data chunk using a primary segmentation into blocks. Secondary segmentations are produced for each of the data mismatches based on reference data, and used to calculate further data matches. The primary segmentation and the corresponding primary digests are stored for the input data chunk.

38 Citations

18 Claims

1. A method for producing a plurality of segmentations of input data into blocks in a data deduplication system using a processor device in a computing environment, comprising:
- calculating digests for an input data chunk using a primary segmentation by using a single linear scan of rolling hash values for calculating both the primary segmentation and similarity search values for the input data chunk, the input data chunk being at least 16 Megabytes (MB) in size;
  
  obtaining and applying secondary segmentations for each one of a plurality of data mismatches based on reference data;
  
  storing the primary segmentation and corresponding primary digests for the input data chunk in a sequence corresponding to a placement order of calculated values of the calculated digests associated with the primary digests, the placement order of the calculated values of the calculated digests correlative to an order in which input digest values were calculated such that the primary digests are stored in a linear form independent of a deduplicated form by which data the primary digests describe is stored;
  
  obtaining the segmentations for each one of the data mismatches by considering input digests included in data matches preceding and following each one of the data mismatches; and
  
  avoiding storing the secondary segmentations and corresponding secondary digests for the input data chunk.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, further including producing data matches and the data mismatches by searching input digests in a search structure of reference digests.
  - 3. The method of claim 1, further including matching the considered input digests with reference digests to produce alternative digest matches.
  - 4. The method of claim 3, further including defining the alternative digests matches to serve as starting positions for the secondary segmentations which are projected onto the input data.
  - 5. The method of claim 4, further including one of:
    - calculating new digest values for the input data based on the secondary segmentations; and
      
      searching new digest values in a search structure of reference digest values to produce new digests matches.
  - 6. The method of claim 5, further including generating new data matches corresponding to the produced new digests matches.

7. A system for producing a plurality of segmentations of input data into blocks in a data deduplication system of a computing environment, the system comprising:
- the data deduplication system;
  
  a repository operating in the data deduplication system;
  
  a memory in the data deduplication system;
  
  a search structure in association with the memory in the data deduplication system; and
  
  at least one processor device operable in the computing storage environment for controlling the data deduplication system, wherein the at least one processor device;
  
  calculates digests for an input data chunk using a primary segmentation by using a single linear scan of rolling hash values for calculating both the primary segmentation and similarity search values for the input data chunk, the input data chunk being at least 16 Megabytes (MB) in size,obtains and applies secondary segmentations for each one of a plurality of data mismatches based on reference data,stores the primary segmentation and corresponding primary digests for the input data chunk in a sequence corresponding to a placement order of calculated values of the calculated digests associated with the primary digests, the placement order of the calculated values of the calculated digests correlative to an order in which input digest values were calculated such that the primary digests are stored in a linear form independent of a deduplicated form by which data the primary digests describe is stored,obtains the segmentations for each one of the data mismatches by considering input digests included in data matches preceding and following each one of the data mismatches, andavoids storing the secondary segmentations and corresponding secondary digests for the input data chunk.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system of claim 7, wherein the at least one processor device produces data matches and the data mismatches by searching input digests in a search structure of reference digests.
  - 9. The system of claim 7, wherein the at least one processor device matches the considered input digests with reference digests to produce alternative digest matches.
  - 10. The system of claim 9, wherein the at least one processor device defines the alternative digests matches to serve as starting positions for the secondary segmentations which are projected onto the input data.
  - 11. The system of claim 10, wherein the at least one processor device performs one of:
    - calculating new digest values for the input data based on the secondary segmentations, andsearching new digest values in a search structure of reference digest values to produce new digests matches.
  - 12. The system of claim 11, wherein the at least one processor device generates new data matches corresponding to the produced new digests matches.

13. A computer program product for producing a plurality of segmentations of input data into blocks in a data deduplication system using a processor device in a computing environment, the computer program product comprising a non-transitory computer-readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising:
- a first executable portion that calculates digests for an input data chunk using a primary segmentation by using a single linear scan of rolling hash values for calculating both the primary segmentation and similarity search values for the input data chunk, the input data chunk being at least 16 Megabytes (MB) in size;
  
  a second executable portion that obtains and applies secondary segmentations for each one of a plurality of data mismatches based on reference data;
  
  a third executable portion that stores the primary segmentation and corresponding primary digests for the input data chunk in a sequence corresponding to a placement order of calculated values of the calculated digests associated with the primary digests, the placement order of the calculated values of the calculated digests correlative to an order in which input digest values were calculated such that the primary digests are stored in a linear form independent of a deduplicated form by which data the primary digests describe is stored;
  
  a fourth executable portion that obtains the segmentations for each one of the data mismatches by considering input digests included in data matches preceding and following each one of the data mismatches; and
  
  a fifth executable portion that avoids storing the secondary segmentations and corresponding secondary digests for the input data chunk.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The computer program product of claim 13, further including a sixth executable portion that produces data matches and the data mismatches by searching input digests in a search structure of reference digests.
  - 15. The computer program product of claim 13, further including a sixth executable portion that matches the considered input digests with reference digests to produce alternative digest matches.
  - 16. The computer program product of claim 15, further including a seventh executable portion that defines the alternative digests matches to serve as starting positions for the secondary segmentations which are projected onto the input data.
  - 17. The computer program product of claim 16, further including an eighth executable portion that performs one of:
    - calculating new digest values for the input data based on the secondary segmentations, andsearching new digest values in a search structure of reference digest values to produce new digests matches.
  - 18. The computer program product of claim 17, further including a ninth executable portion that generates new data matches corresponding to the produced new digests matches.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Aronovich, Lior
Primary Examiner(s)
Bertram, Ryan
Assistant Examiner(s)
TA, TRANG KHANH

Application Number

US13/941,982
Publication Number

US 20150019508A1
Time in Patent Office

1,709 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/1752 based on file chunks

Producing alternative segmentations of data into blocks in a data deduplication system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

38 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Producing alternative segmentations of data into blocks in a data deduplication system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

38 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links