Adaptive similarity search resolution in a data deduplication system
First Claim
Patent Images
1. A method for adaptive similarity search resolution in a data deduplication system using a processor device in a computing environment, comprising:
- partitioning input data into input data chunks, the input data chunks each being at least 4 Megabytes (MB) in size;
calculating input similarity elements for an input chunk;
using the input similarity elements to find similar data in a repository of data using a similarity search structure;
calculating a resolution level for storing the input similarity elements, the resolution level comprising a number of the input similarity elements in relation to a size of the input chunk;
storing the input similarity elements in the calculated resolution level in the similarity search structure;
deduplicating the input chunk with the found similar data in the repository of data using the input similarity units in the calculated resolution level;
calculating the resolution level for storing the input similarity elements based on calculated sets of similarity element matches and on a calculated deduplication ratio, the deduplication ratio defined as a total size of the input data covered by matches with repository data out of the total size of the input data; and
decreasing the resolution level of the stored input similarity elements if an aggregated deduplication ratio is not lower than a predefined threshold and an average size of the calculated sets of similarity element matches is not lower than two and a current resolution level is higher than a lowest resolution level.
1 Assignment
0 Petitions
Accused Products
Abstract
For adaptive similarity search resolution in a data deduplication system using a processor device in a computing environment, input data is partitioned into data chunks. Input similarity elements are calculated for an input chunk. The input similarity elements are used to find similar data in a repository of data using a similarity search structure. A resolution level is calculated for storing the input similarity elements. The input similarity elements are stored in the calculated resolution level in the similarity search structure.
43 Citations
18 Claims
-
1. A method for adaptive similarity search resolution in a data deduplication system using a processor device in a computing environment, comprising:
-
partitioning input data into input data chunks, the input data chunks each being at least 4 Megabytes (MB) in size; calculating input similarity elements for an input chunk; using the input similarity elements to find similar data in a repository of data using a similarity search structure; calculating a resolution level for storing the input similarity elements, the resolution level comprising a number of the input similarity elements in relation to a size of the input chunk; storing the input similarity elements in the calculated resolution level in the similarity search structure; deduplicating the input chunk with the found similar data in the repository of data using the input similarity units in the calculated resolution level; calculating the resolution level for storing the input similarity elements based on calculated sets of similarity element matches and on a calculated deduplication ratio, the deduplication ratio defined as a total size of the input data covered by matches with repository data out of the total size of the input data; and decreasing the resolution level of the stored input similarity elements if an aggregated deduplication ratio is not lower than a predefined threshold and an average size of the calculated sets of similarity element matches is not lower than two and a current resolution level is higher than a lowest resolution level. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system for adaptive similarity search resolution in a data deduplication system of a computing environment, the system comprising:
-
the data deduplication system; a repository operating in the data deduplication system; a memory in the data deduplication system; a similarity search structure in association with the memory in the data deduplication system; and at least one processor device operable in the computing storage environment for controlling the data deduplication system, wherein the at least one processor device; partitions input data into input data chunks, the input data chunks each being at least 4 Megabytes (MB) in size, calculates input similarity elements for an input chunk; uses the input similarity elements to find similar data in a repository of data using the similarity search structure, calculates a resolution level for storing the input similarity elements, the resolution level comprising a number of the input similarity elements in relation to a size of the input chunk, stores the input similarity elements in the calculated resolution level in the similarity search structure, deduplicates the input chunk with the found similar data in the repository of data using the input similarity units in the calculated resolution level; calculates the resolution level for storing the input similarity elements based on calculated sets of similarity element matches and on a calculated deduplication ratio, the deduplication ratio defined as a total size of the input data covered by matches with repository data out of the total size of the input data; and decreases the resolution level of the stored input similarity elements if an aggregated deduplication ratio is not lower than a predefined threshold and an average size of the calculated sets of similarity element matches is not lower than two and a current resolution level is higher than a lowest resolution level. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A computer program product for adaptive similarity search resolution in a data deduplication system using a processor device in a computing environment, the computer program product comprising a non-transitory computer-readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising:
-
an executable portion that partitions input data into input data chunks, the input data chunks each being at least 4 Megabytes (MB) in size; an executable portion that calculates input similarity elements for an input chunk; an executable portion that uses the input similarity elements to find similar data in a repository of data using a similarity search structure; an executable portion that calculates a resolution level for storing the input similarity elements, the resolution level comprising a number of the input similarity elements in relation to a size of the input chunk; an executable portion that stores the input similarity elements in the calculated resolution level in the similarity search structure; an executable portion that deduplicates the input chunk with the found similar data in the repository of data using the input similarity units in the calculated resolution level; an executable portion that calculates the resolution level for storing the input similarity elements based on calculated sets of similarity element matches and on a calculated deduplication ratio, the deduplication ratio defined as a total size of the input data covered by matches with repository data out of the total size of the input data; and an executable portion that decreases the resolution level of the stored input similarity elements if an aggregated deduplication ratio is not lower than a predefined threshold and an average size of the calculated sets of similarity element matches is not lower than two and a current resolution level is higher than a lowest resolution level. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification