Detection and deduplication of backup sets exhibiting poor locality
First Claim
1. A computerized method for storing data comprising:
- determining, by a computing device, a first set of summaries of a first data set, each summary of the first set of summaries being indicative of a data pattern in the first data set at an associated location in the first data set;
determining, by the computing device, a second set of summaries of a second data set, each summary of the second set of summaries being indicative of a data pattern in the second data set at an associated location in the second data set;
calculating, by the computing device, a set of comparison metrics, each comparison metric being based on a first subset of summaries from the first set of summaries and a second subset of summaries from the second set of summaries;
calculating, by the computing device, a locality metric based on the set of comparison metrics, the locality metric being indicative of a ratio of data within the first data set which is distributed as redundant data within the second data set with distance greater than a predetermined threshold;
adjusting at least one parameter of a deduplication process based on the locality metric, the at least one parameter including at least one of a detection parameter and a deduplication parameter; and
deduplicating the first data set and the second data set using the deduplication process.
6 Assignments
0 Petitions
Accused Products
Abstract
Described are computer-based methods and apparatuses, including computer program products, for detection and deduplication of backup sets exhibiting poor locality. A first set of summaries of a first data set are determined, each summary of the first set of summaries being indicative of a data pattern in the first data set. A second set of summaries of a second data set are determined, each summary of the second set of summaries being indicative of a data pattern in the second data set. A set of comparison metrics are calculated, each comparison metric being based on a first subset of summaries from the first set of summaries and a second subset of summaries from the second set of summaries. A locality metric is calculated based on the set of comparison metrics indicative of whether the first data set and second data set exhibit poor locality.
105 Citations
20 Claims
-
1. A computerized method for storing data comprising:
-
determining, by a computing device, a first set of summaries of a first data set, each summary of the first set of summaries being indicative of a data pattern in the first data set at an associated location in the first data set; determining, by the computing device, a second set of summaries of a second data set, each summary of the second set of summaries being indicative of a data pattern in the second data set at an associated location in the second data set; calculating, by the computing device, a set of comparison metrics, each comparison metric being based on a first subset of summaries from the first set of summaries and a second subset of summaries from the second set of summaries; calculating, by the computing device, a locality metric based on the set of comparison metrics, the locality metric being indicative of a ratio of data within the first data set which is distributed as redundant data within the second data set with distance greater than a predetermined threshold; adjusting at least one parameter of a deduplication process based on the locality metric, the at least one parameter including at least one of a detection parameter and a deduplication parameter; and deduplicating the first data set and the second data set using the deduplication process. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer program product, tangibly embodied in a non-transitory computer readable medium, the computer program product including instructions being configured to cause a data processing apparatus to:
-
determine a first set of summaries of a first data set, each summary of the first set of summaries being indicative of a data pattern in the first data set at an associated location in the first data set; determine a second set of summaries of a second data set, each summary of the second set of summaries being indicative of a data pattern in the second data set at an associated location in the second data set; calculate a set of comparison metrics, each comparison metric being based on a first subset of summaries from the first set of summaries and a second subset of summaries from the second set of summaries; calculate a locality metric based on the set of comparison metrics, the locality metric being indicative of a ratio of data within the first data set which is distributed as redundant data within the second data set with distance greater than a predetermined threshold; adjust at least one parameter of a deduplication process based on the locality metric, the at least one parameter including at least one of a detection parameter and a deduplication parameter; and deduplicate the first data set and the second data set using the deduplication process. - View Dependent Claims (18)
-
-
19. An apparatus comprising a processor and memory configured to:
-
determine a first set of summaries of a first data set, each summary of the first set of summaries being indicative of a data pattern in the first data set at an associated location in the first data set; determine a second set of summaries of a second data set, each summary of the second set of summaries being indicative of a data pattern in the second data set at an associated location in the second data set; calculate a set of comparison metrics, each comparison metric being based on a first subset of summaries from the first set of summaries and a second subset of summaries from the second set of summaries; calculate a locality metric based on the set of comparison metrics, the locality metric being indicative of a ratio of data within the first data set which is distributed as redundant data within the second data set with distance greater than a predetermined threshold; adjust at least one parameter of a deduplication process based on the locality metric, the at least one parameter including at least one of a detection parameter and a deduplication parameter; and deduplicate the first data set and the second data set using the deduplication process. - View Dependent Claims (20)
-
Specification