Adaptive similarity search resolution in a data deduplication system

US 10,073,853 B2
Filed: 07/17/2013
Issued: 09/11/2018
Est. Priority Date: 07/17/2013
Status: Active Grant

First Claim

Patent Images

1. A method for adaptive similarity search resolution in a data deduplication system using a processor device in a computing environment, comprising:

partitioning input data into input data chunks, the input data chunks each being at least 4 Megabytes (MB) in size;

calculating input similarity elements for an input chunk;

using the input similarity elements to find similar data in a repository of data using a similarity search structure;

calculating a resolution level for storing the input similarity elements, the resolution level comprising a number of the input similarity elements in relation to a size of the input chunk;

storing the input similarity elements in the calculated resolution level in the similarity search structure;

deduplicating the input chunk with the found similar data in the repository of data using the input similarity units in the calculated resolution level;

calculating the resolution level for storing the input similarity elements based on calculated sets of similarity element matches and on a calculated deduplication ratio, the deduplication ratio defined as a total size of the input data covered by matches with repository data out of the total size of the input data; and

decreasing the resolution level of the stored input similarity elements if an aggregated deduplication ratio is not lower than a predefined threshold and an average size of the calculated sets of similarity element matches is not lower than two and a current resolution level is higher than a lowest resolution level.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

For adaptive similarity search resolution in a data deduplication system using a processor device in a computing environment, input data is partitioned into data chunks. Input similarity elements are calculated for an input chunk. The input similarity elements are used to find similar data in a repository of data using a similarity search structure. A resolution level is calculated for storing the input similarity elements. The input similarity elements are stored in the calculated resolution level in the similarity search structure.

43 Citations

View as Search Results

18 Claims

1. A method for adaptive similarity search resolution in a data deduplication system using a processor device in a computing environment, comprising:
- partitioning input data into input data chunks, the input data chunks each being at least 4 Megabytes (MB) in size;
  
  calculating input similarity elements for an input chunk;
  
  using the input similarity elements to find similar data in a repository of data using a similarity search structure;
  
  calculating a resolution level for storing the input similarity elements, the resolution level comprising a number of the input similarity elements in relation to a size of the input chunk;
  
  storing the input similarity elements in the calculated resolution level in the similarity search structure;
  
  deduplicating the input chunk with the found similar data in the repository of data using the input similarity units in the calculated resolution level;
  
  calculating the resolution level for storing the input similarity elements based on calculated sets of similarity element matches and on a calculated deduplication ratio, the deduplication ratio defined as a total size of the input data covered by matches with repository data out of the total size of the input data; and
  
  decreasing the resolution level of the stored input similarity elements if an aggregated deduplication ratio is not lower than a predefined threshold and an average size of the calculated sets of similarity element matches is not lower than two and a current resolution level is higher than a lowest resolution level.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, further including defining the resolution level for storing the input similarity elements to be between a highest resolution level and a lowest resolution level.
  - 3. The method of claim 1, further including performing one of:
    - calculating an average size of the calculated sets of similarity element matches, and using the average size to determine the resolution level for storing the input similarity elements.
  - 4. The method of claim 3, further including defining a set of similarity element matches to include similarity element matches with a similar angle, where an angle of a similarity element match is the difference between a position of the similarity element match in the repository data and a position of the similarity element match in the input data, and where two angles are considered as similar if a difference of the two angles does not exceed a predefined threshold.
  - 5. The method of claim 1 further including performing one of:
    - calculating an aggregated deduplication ratio as a total size of portions of the input data chunks covered by data matches out of a total size of the input data chunks, and using the aggregated deduplication ratio to determine the resolution level for storing the input similarity elements.
  - 6. The method of claim 1, further including increasing a storage resolution level of similarity elements if an aggregated deduplication ratio is lower than a predefined threshold and a current resolution level is lower than a highest resolution level.

7. A system for adaptive similarity search resolution in a data deduplication system of a computing environment, the system comprising:
- the data deduplication system;
  
  a repository operating in the data deduplication system;
  
  a memory in the data deduplication system;
  
  a similarity search structure in association with the memory in the data deduplication system; and
  
  at least one processor device operable in the computing storage environment for controlling the data deduplication system, wherein the at least one processor device;
  
  partitions input data into input data chunks, the input data chunks each being at least 4 Megabytes (MB) in size,calculates input similarity elements for an input chunk;
  
  uses the input similarity elements to find similar data in a repository of data using the similarity search structure,calculates a resolution level for storing the input similarity elements, the resolution level comprising a number of the input similarity elements in relation to a size of the input chunk,stores the input similarity elements in the calculated resolution level in the similarity search structure,deduplicates the input chunk with the found similar data in the repository of data using the input similarity units in the calculated resolution level;
  
  calculates the resolution level for storing the input similarity elements based on calculated sets of similarity element matches and on a calculated deduplication ratio, the deduplication ratio defined as a total size of the input data covered by matches with repository data out of the total size of the input data; and
  
  decreases the resolution level of the stored input similarity elements if an aggregated deduplication ratio is not lower than a predefined threshold and an average size of the calculated sets of similarity element matches is not lower than two and a current resolution level is higher than a lowest resolution level.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system of claim 7, wherein the at least one processor device defines the resolution level for storing the input similarity elements to be between a highest resolution level and a lowest resolution level.
  - 9. The system of claim 7, wherein the at least one processor device performs one of calculating an average size of the calculated sets of similarity element matches, and using the average size to determine the resolution level for storing the input similarity elements.
  - 10. The system of claim 9, wherein the at least one processor device defines a set of similarity element matches to include similarity element matches with a similar angle, where an angle of a similarity element match is the difference between a position of the similarity element match in the repository data and a position of the similarity element match in the input data, and where two angles are considered as similar if a difference of the two angles does not exceed a predefined threshold.
  - 11. The system of claim 7, wherein the at least one processor device performs one of:
    - calculating an aggregated deduplication ratio as a total size of portions of the input data chunks covered by data matches out of a total size of the input data chunks, and using the aggregated deduplication ratio to determine the resolution level for storing the input similarity elements.
  - 12. The system of claim 7, wherein the at least one processor device increases a storage resolution level of similarity elements if an aggregated deduplication ratio is lower than a predefined threshold and a current resolution level is lower than a highest resolution level.

13. A computer program product for adaptive similarity search resolution in a data deduplication system using a processor device in a computing environment, the computer program product comprising a non-transitory computer-readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising:
- an executable portion that partitions input data into input data chunks, the input data chunks each being at least 4 Megabytes (MB) in size;
  
  an executable portion that calculates input similarity elements for an input chunk;
  
  an executable portion that uses the input similarity elements to find similar data in a repository of data using a similarity search structure;
  
  an executable portion that calculates a resolution level for storing the input similarity elements, the resolution level comprising a number of the input similarity elements in relation to a size of the input chunk;
  
  an executable portion that stores the input similarity elements in the calculated resolution level in the similarity search structure;
  
  an executable portion that deduplicates the input chunk with the found similar data in the repository of data using the input similarity units in the calculated resolution level;
  
  an executable portion that calculates the resolution level for storing the input similarity elements based on calculated sets of similarity element matches and on a calculated deduplication ratio, the deduplication ratio defined as a total size of the input data covered by matches with repository data out of the total size of the input data; and
  
  an executable portion that decreases the resolution level of the stored input similarity elements if an aggregated deduplication ratio is not lower than a predefined threshold and an average size of the calculated sets of similarity element matches is not lower than two and a current resolution level is higher than a lowest resolution level.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The computer program product of claim 13, further including an executable portion that defines the resolution level for storing the input similarity elements to be between a highest resolution level and a lowest resolution level.
  - 15. The computer program product of claim 13, further including an executable portion that performs one of:
    - calculating an average size of the calculated sets of similarity element matches, and using the average size to determine the resolution level for storing the input similarity elements.
  - 16. The computer program product of claim 15, further including an executable portion that defines a set of similarity element matches to include similarity element matches with a similar angle, where an angle of a similarity element match is the difference between a position of the similarity element match in the repository data and a position of the similarity element match in the input data, and where two angles are considered as similar if a difference of the two angles does not exceed a predefined threshold.
  - 17. The computer program product of claim 13, further including an executable portion that performs one of calculating an aggregated deduplication ratio as a total size of portions of the input data chunks covered by data matches out of a total size of the input data chunks, and using the aggregated deduplication ratio to determine the resolution level for storing the input similarity elements.
  - 18. The computer program product of claim 13, further including an executable portion that increases a storage resolution level of similarity elements if an aggregated deduplication ratio is lower than a predefined threshold and a current resolution level is lower than a highest resolution level.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Aronovich, Lior
Primary Examiner(s)
Featherstone, Mark D
Assistant Examiner(s)
Gmahl, Navneet

Application Number

US13/941,800
Publication Number

US 20150026135A1
Time in Patent Office

1,882 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/137   Hash-based content-based in...

G06F 16/152   using file content signatur...

G06F 16/1752   based on file chunks

Adaptive similarity search resolution in a data deduplication system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

43 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Adaptive similarity search resolution in a data deduplication system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

43 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links