OPTIMIZING HASH TABLE STRUCTURE FOR DIGEST MATCHING IN A DATA DEDUPLICATION SYSTEM

US 20150019507A1
Filed: 07/15/2013
Published: 01/15/2015
Est. Priority Date: 07/15/2013
Status: Active Grant

First Claim

Patent Images

1. A method for optimizing a hash table structure for digest matching in a data deduplication system using a processor device in a computing environment, comprising:

determining a repository data interval as similar to an input data interval;

loading a plurality of repository digests corresponding to the similar repository data interval into a sequential representation and into a search structure; and

incorporating into entries of the search structure a compact index pointing to a position in the sequential representation of a plurality of digests.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Repository data intervals are determined as similar to an input data interval. Repository digests corresponding to the similar repository data interval are loaded into a sequential representation and into a search structure. Matches of input digests and the repository digests are found using the search structure. Each one of the found matches of the input digests and repository digests are extended using the sequential representation. Data matches are determined between the input data and the repository data using extended matches of digests. A compact index pointing to a position in the sequential representation of digests is incorporated into entries of the search structure.

Citations

24 Claims

1. A method for optimizing a hash table structure for digest matching in a data deduplication system using a processor device in a computing environment, comprising:
- determining a repository data interval as similar to an input data interval;
  
  loading a plurality of repository digests corresponding to the similar repository data interval into a sequential representation and into a search structure; and
  
  incorporating into entries of the search structure a compact index pointing to a position in the sequential representation of a plurality of digests.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further including applying a similarity search process for determining the repository data interval as similar to the input data interval.
  - 3. The method of claim 1, further including defining a digest entry to include a digest value and a segment position in data and a segment size.
  - 4. The method of claim 3, further including performing one of:
    - storing digest entries of the plurality of repository digests of the similar repository data interval in the sequential representation, andavoiding a storing of the digest entries of the plurality of repository digests in the search structure.
  - 5. The method of claim 3, further including performing each of:
    - searching for the plurality of repository digests matching the input digest using the search structure,obtaining from the search structure a plurality of indexes of potential digest matches, andchecking that a repository digest entry located at a referenced position in the sequential representation comprises of a digest value and a digest segment size, which match the digest value and the digest segment size of an input digest, for each one of the obtained plurality of indexes.
  - 6. The method of claim 1, further including defining the search structure to be a hash table.
  - 7. The method of claim 3, further including defining the sequential representation storing the plurality of repository digests as a sequential array containing a plurality of digest entries in a sequence of occurrence in the data.
  - 8. The method of claim 1, further including specifying an interval of data by a starting position and a size.

9. A system for optimizing a hash table structure for digest matching in a data deduplication system of a computing environment, the system comprising:
- the data deduplication system;
  
  the dual data structures in the data deduplication system, wherein the dual data structures include a search structure and a sequential buffer;
  
  a hash table included in the data deduplication system;
  
  a repository operating in the data deduplication system; and
  
  at least one processor device operable in the computing storage environment for controlling the data deduplication system, wherein the at least one processor device;
  
  determines a repository data interval as similar to an input data interval,loads a plurality of repository digests corresponding to the similar repository data interval into a sequential representation and into the search structure, andincorporates into entries of the search structure a compact index pointing to a position in the sequential representation of a plurality of digests.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The system of claim 9, wherein the at least one processor device applies a similarity search process for determining the repository data interval as similar to the input data interval.
  - 11. The system of claim 9, wherein the at least one processor defines a digest entry to include a digest value and a segment position in data and a segment size.
  - 12. The system of claim 11, wherein the at least one processor performs one of:
    - storing digest entries of the plurality of repository digests of the similar repository data interval in the sequential representation, andavoiding a storing of the digest entries of the plurality of repository digests in the search structure.
  - 13. The system of claim 11, wherein the at least one processor performs each of:
    - searching for the plurality of repository digests matching the input digest using the search structure,obtaining from the search structure a plurality of indexes of potential digest matches, andchecking that a repository digest entry located at a referenced position in the sequential representation comprises of a digest value and a digest segment size, which match the digest value and the digest segment size of an input digest, for each one of the obtained plurality of indexes.
  - 14. The system of claim 9, wherein the at least one processor device defines the search structure to be the hash table.
  - 15. The system of claim 11, wherein the at least one processor defines the sequential representation storing the plurality of repository digests as a sequential array containing a plurality of digest entries in a sequence of occurrence in the data.
  - 16. The system of claim 9, wherein the at least one processor device specifies an interval of data by a starting position and a size.

17. A computer program product for optimizing a hash table structure for digest matching in a data deduplication system using a processor device in a computing environment, the computer program product comprising a computer-readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising:
- a first executable portion that determines a repository data interval as similar to an input data interval;
  
  a second executable portion that loads a plurality of repository digests corresponding to the similar repository data interval into a sequential representation and into a search structure; and
  
  a third executable portion that incorporates into entries of the search structure a compact index pointing to a position in the sequential representation of a plurality of digests.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
- - 18. The computer program product of claim 17, further including a fourth executable portion that applies a similarity search process for determining the repository data interval as similar to the input data interval.
  - 19. The computer program product of claim 17, further including a fourth executable portion that defines a digest entry to include a digest value and a segment position in data and a segment size.
  - 20. The computer program product of claim 19, further including a fifth executable portion that performs one of:
    - storing digest entries of the plurality of repository digests of the similar repository data interval in the sequential representation, andavoiding a storing of the digest entries of the plurality of repository digests in the search structure.
  - 21. The computer program product of claim 19, further including a fifth executable portion that performs each of:
    - searching for the plurality of repository digests matching the input digest using the search structure,obtaining from the search structure a plurality of indexes of potential digest matches, andchecking that a repository digest entry located at a referenced position in the sequential representation comprises of a digest value and a digest segment size, which match the digest value and the digest segment size of an input digest, for each one of the obtained plurality of indexes.
  - 22. The computer program product of claim 17, further including a fourth executable portion that defines the search structure to be the hash table.
  - 23. The computer program product of claim 19, further including a fifth executable portion that defines the sequential representation storing the plurality of repository digests as a sequential array containing a plurality of digest entries in a sequence of occurrence in the data.
  - 24. The computer program product of claim 17, further including a fourth executable portion that specifies an interval of data by a starting position and a size.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
ARONOVICH, Lior

Granted Patent

US 10,339,109 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/692
CPC Class Codes

G06F 16/1748 De-duplication implemented ...

OPTIMIZING HASH TABLE STRUCTURE FOR DIGEST MATCHING IN A DATA DEDUPLICATION SYSTEM

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

OPTIMIZING HASH TABLE STRUCTURE FOR DIGEST MATCHING IN A DATA DEDUPLICATION SYSTEM

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links