Reducing digest storage consumption in a data deduplication system

US 9,678,975 B2
Filed: 03/15/2013
Issued: 06/13/2017
Est. Priority Date: 03/15/2013
Status: Active Grant

First Claim

Patent Images

1. A method for reducing digests storage consumption in a data deduplication system using a processor device in a computing environment, comprising:

calculating digest values for input data by using a single linear scan of rolling hash values for producing both similarity search values and boundaries of digest blocks;

using the digest values to locate matches with data stored in a repository;

storing the digest values in the repository;

removing the digest values of the data stored in the repository that is determined to be redundant with the input data;

storing the digest values in the repository linearly in a sequence of occurrence of the digest values in the data; and

storing the digest values in the repository in a form that is independent of the form by which the data that the digest values describe is stored.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

For reducing digests storage consumption in a data deduplication system using a processor device in a computing environment, digest values are calculated for input data. The digest values are used to locate matches with data stored in a repository. The digest values are stored in the repository. The digest values of the data stored in the repository that is determined to be redundant with the input data are removed.

40 Citations

18 Claims

1. A method for reducing digests storage consumption in a data deduplication system using a processor device in a computing environment, comprising:
- calculating digest values for input data by using a single linear scan of rolling hash values for producing both similarity search values and boundaries of digest blocks;
  
  using the digest values to locate matches with data stored in a repository;
  
  storing the digest values in the repository;
  
  removing the digest values of the data stored in the repository that is determined to be redundant with the input data;
  
  storing the digest values in the repository linearly in a sequence of occurrence of the digest values in the data; and
  
  storing the digest values in the repository in a form that is independent of the form by which the data that the digest values describe is stored.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, further including determining the data stored in the repository, which is used to produce matches with the input data, to be redundant with the input data.
  - 3. The method of claim 1, further including matching digest values of input data with digest values stored in the repository to locate matches with the data stored in the repository.
  - 4. The method of claim 1, further including partitioning the input data into data chunks and grouping the data chunks into chunk sets.
  - 5. The method of claim 4, further including storing the digest values in sets corresponding to the chunk sets, where the digest values sets can be efficiently accessed and removed.
  - 6. The method of claim 1, further including removing digests of redundant repository data to make the digests storage consumption correlative to a factored size of the data stored in the repository rather than to a total data size in the repository.

7. A system for reducing digests storage consumption in a data deduplication system of a computing environment, the system comprising:
- the data deduplication system;
  
  a repository operating in the data deduplication system;
  
  at least one processor device operable in the computing storage environment for controlling the data deduplication system, wherein the at least one processor device;
  
  calculates digest values for input data by using a single linear scan of rolling hash values for producing both similarity search values and boundaries of digest blocks,uses the digest values to locate matches with data stored in a repository,stores the digest values in the repository,removes the digest values of the data stored in the repository that is determined to be redundant with the input data,stores the digest values in the repository linearly in a sequence of occurrence of the digest values in the data, andstores the digest values in the repository in a form that is independent of the form by which the data that the digest values describe is stored.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system of claim 7, wherein the at least one processor device determines the data stored in the repository, which is used to produce matches with the input data, to be redundant with the input data.
  - 9. The system of claim 7, wherein the at least one processor device matches digest values of input data with digest values stored in the repository to locate matches with the data stored in the repository.
  - 10. The system of claim 7, wherein the at least one processor device partitions the input data into data chunks and grouping the data chunks into chunk sets.
  - 11. The system of claim 10, wherein the at least one processor device stores the digest values in sets corresponding to the chunk sets, where the digest values sets can be efficiently accessed and removed.
  - 12. The system of claim 7, wherein the at least one processor device removes digests of redundant repository data to make the digests storage consumption correlative to a factored size of the data stored in the repository rather than to a total data size in the repository.

13. A computer program product for reducing digests storage consumption in a data deduplication system using a processor device in a computing environment, the computer program product comprising a computer-readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising:
- a first executable portion that calculates digest values for input data by using a single linear scan of rolling hash values for producing both similarity search values and boundaries of digest blocks;
  
  a second executable portion that uses the digest values to locate matches with data stored in a repository;
  
  a third executable portion that stores the digest values in the repository;
  
  a fourth executable portion that removes the digest values of the data stored in the repository that is determined to be redundant with the input data;
  
  a fifth executable portion that stores the digest values in the repository linearly in a sequence of occurrence of the digest values in the data; and
  
  a sixth executable portion that stores the digest values in the repository in a form that is independent of the form by which the data that the digest values describe is stored.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The computer program product of claim 13, further including a seventh executable portion that determines the data stored in the repository, which is used to produce matches with the input data, to be redundant with the input data.
  - 15. The computer program product of claim 13, further including a seventh executable portion that matches digest values of input data with digest values stored in the repository to locate matches with the data stored in the repository.
  - 16. The computer program product of claim 13, further including a seventh executable portion that partitions the input data into data chunks and grouping the data chunks into chunk sets.
  - 17. The computer program product of claim 16, further including an eighth executable portion that stores the digest values in sets corresponding to the chunk sets, where the digest values sets can be efficiently accessed and removed.
  - 18. The computer program product of claim 13, further including a seventh executable portion that removes digests of redundant repository data to make the digests storage consumption correlative to a factored size of the data stored in the repository rather than to a total data size in the repository.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Aronovich, Lior
Primary Examiner(s)
Nguyen, Cam-Linh

Application Number

US13/840,314
Publication Number

US 20140279953A1
Time in Patent Office

1,551 Days
Field of Search

707692
US Class Current
CPC Class Codes

G06F 16/137 Hash-based content-based in...

G06F 16/1752 based on file chunks

Reducing digest storage consumption in a data deduplication system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

40 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Reducing digest storage consumption in a data deduplication system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

40 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links