Data deduplication with adaptive erasure code redundancy
First Claim
1. A non-transitory computer-readable storage medium storing computer-executable instructions that when executed by a computer control the computer to perform a method, the method comprising:
- generating, by a processor, a characterization of segments of a plurality of segments with failure probabilities associated with the segments that indicate a likelihood of failure of the segments;
parsing, by a processor, the plurality of segments to generate a plurality of original chunks;
deduplicating, by a processor, the plurality of original chunks to generate a plurality of unique chunks;
grouping, by a processor, the plurality of unique chunks to form grouped chunks;
encoding, by a processor, the grouped chunks to generate a desired number of erasure code symbols that satisfy the failure probabilities associated with the segments; and
selectively storing members of the desired number of erasure code symbols on a number of storage devices.
7 Assignments
0 Petitions
Accused Products
Abstract
Example apparatus and methods combine erasure coding with data deduplication to simultaneously reduce the overall redundancy in data while increasing the redundancy of unique data. In one embodiment, an efficient representation of a data set is produced by deduplication. The efficient representation reduces duplicate data in the data set. Redundancy is then added back into the data set using erasure coding. The redundancy that is added back in adds protection to the unique data associated with the efficient representation. How much redundancy is added back in and what type of redundancy is added back in may be controlled based on an attribute (e.g., value, reference count, symbol size, number of symbols) of the unique data. Decisions concerning how much and what type of redundancy to add back in may be adapted over time based, for example, on observations of the efficiency of the overall system.
-
Citations
20 Claims
-
1. A non-transitory computer-readable storage medium storing computer-executable instructions that when executed by a computer control the computer to perform a method, the method comprising:
-
generating, by a processor, a characterization of segments of a plurality of segments with failure probabilities associated with the segments that indicate a likelihood of failure of the segments; parsing, by a processor, the plurality of segments to generate a plurality of original chunks; deduplicating, by a processor, the plurality of original chunks to generate a plurality of unique chunks; grouping, by a processor, the plurality of unique chunks to form grouped chunks; encoding, by a processor, the grouped chunks to generate a desired number of erasure code symbols that satisfy the failure probabilities associated with the segments; and selectively storing members of the desired number of erasure code symbols on a number of storage devices. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method comprising:
-
generating, by a processor, a characterization of segments of a plurality of segments with failure probabilities associated with the segments that indicate a likelihood of failure of the segments; parsing, by a processor, the plurality of segments to generate a plurality of original chunks, deduplicating, by a processor, the plurality of original chunks to generate a plurality of unique chunks; calculating, by a processor, attributes of unique chunks of the plurality of unique chunks based, at least in part, on the failure probabilities associated with corresponding segments; grouping, by a processor, the plurality of unique chunks to form a grouped chunk; encoding, by a processor, the grouped chunk to generate a desired number of erasure code symbols based, at least in part, on the attributes being constrained by the failure probabilities; and selectively storing members of the desired number of erasure code symbols on a number of storage devices, where the number of storage devices is based, at least in part, on the attributes for the unique chunks. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. An apparatus, comprising:
-
a number of storage devices; and a processor configured to; generate a characterization of segments of a plurality of segments with failure probabilities associated with the segments that indicate a likelihood of failure of segments; parse the plurality of segments to generate a plurality of original chunks, deduplicate the plurality of original chunks to generate a plurality of unique chunks; calculate attributes of unique chunks of the plurality of unique chunks based, at least in part, on the failure probabilities; group the plurality of unique chunks to form at least one grouped chunk; encode the at least one grouped chunk to generate a desired number of erasure code symbols that satisfy the failure probabilities associated with the segments, where the encoding is based, at least in part, on the attributes being constrained by the failure probabilities; and selectively store members of the desired number of erasure code symbols on the number of storage devices, where the number of storage devices is based, at least in part, on the attributes for the unique chunks. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification