Data deduplication with adaptive erasure code redundancy
First Claim
1. A non-transitory computer-readable storage medium storing computer-executable instructions that when executed by a computer control the computer to perform a method, the method comprising:
- accessing unique data produced by a data deduplication system;
identifying a property of the unique data, andmanipulating, based at least in part on the property, a generator matrix representation of an encoding graph associated with an erasure encoder, where manipulating the generator matrix includes controlling a number of elements in the generator matrix or controlling the value of one or more elements in the generator matrix, where a non-zero entry in the generator matrix represents a node/edge probability distribution;
generating, using the erasure encoder, a set of W erasure code symbols for the unique data, where the erasure code symbols are generated according to an X/Y erasure code policy, W, X, and Y being integers, W being greater than or equal to X, W being less than or equal to Y, and where W, X, or Y depend, at least in part, on the property of the unique data, where W, X, or Y are directly proportional to the property; and
selectively storing members of the set of W erasure code symbols on Z different data storage devices, where Z is a function of a second property of the unique data, where the second property includes a size of the unique data, an age of the unique data, a cost to replace the unique data, or a user-assigned value, where the value of the second property may vary over time, Z being an integer.
8 Assignments
0 Petitions
Accused Products
Abstract
Example apparatus and methods combine erasure coding with data deduplication to simultaneously reduce the overall redundancy in data while increasing the redundancy of unique data. In one embodiment, an efficient representation of a data set is produced by deduplication. The efficient representation reduces duplicate data in the data set. Redundancy is then added back into the data set using erasure coding. The redundancy that is added back in adds protection to the unique data associated with the efficient representation. How much redundancy is added back in and what type of redundancy is added back in may be controlled based on an attribute (e.g., value, reference count, symbol size, number of symbols) of the unique data. Decisions concerning how much and what type of redundancy to add back in may be adapted over time based, for example, on observations of the efficiency of the overall system.
-
Citations
15 Claims
-
1. A non-transitory computer-readable storage medium storing computer-executable instructions that when executed by a computer control the computer to perform a method, the method comprising:
-
accessing unique data produced by a data deduplication system; identifying a property of the unique data, and manipulating, based at least in part on the property, a generator matrix representation of an encoding graph associated with an erasure encoder, where manipulating the generator matrix includes controlling a number of elements in the generator matrix or controlling the value of one or more elements in the generator matrix, where a non-zero entry in the generator matrix represents a node/edge probability distribution; generating, using the erasure encoder, a set of W erasure code symbols for the unique data, where the erasure code symbols are generated according to an X/Y erasure code policy, W, X, and Y being integers, W being greater than or equal to X, W being less than or equal to Y, and where W, X, or Y depend, at least in part, on the property of the unique data, where W, X, or Y are directly proportional to the property; and selectively storing members of the set of W erasure code symbols on Z different data storage devices, where Z is a function of a second property of the unique data, where the second property includes a size of the unique data, an age of the unique data, a cost to replace the unique data, or a user-assigned value, where the value of the second property may vary over time, Z being an integer. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. An apparatus, comprising:
-
a processor; a memory; a set of logics; and an interface that connects the set of logics, the memory, and the processor, the set of logics comprising; a first logic that manipulates a generator matrix representation of an encoding graph associated with an erasure encoder based on a first attribute of a message received from a data deduplication system, where the first attribute describes an importance of the message to the data deduplication system, a size of the message, an amount to be spent protecting the message, or an age of the message; a second logic that produces, using the erasure encoder, a set of N erasure code symbols for the message, where the message has K symbols, where N is a function of the first attribute, N and K being integers, N being greater than K, where a member of the set of N erasure code symbols is a systematic, rateless erasure code; and a third logic that selectively stores members of the set of N erasure code symbols on Z different data storage devices, where Z is a function of a second attribute of the message, Z being an integer. - View Dependent Claims (12, 13, 14)
-
-
15. A method for storing de-duplicated data in a data storage system comprising:
-
accessing a message produced by a data deduplication system; identifying a property of the message; manipulating, based at least in part on the property, a generator matrix representation of an encoding graph associated with an erasure encoder; generating, using the erasure encoder, a set of erasure code symbols for the message, where the erasure code symbols are generated according to a code rate that defines a minimum number of encoded symbols needed to recreate the message and a total number of encoded symbols that can be produced for the message, where the number of erasure code symbols in the set of erasure code symbols is greater than the minimum number of encoded symbols needed to recreate the message, where the number of erasure code symbols in the set of erasure code symbols is less than the total number of encoded symbols that can be produced for the message, and where the number of erasure code symbols in the set of erasure code symbols, the minimum number of encoded symbols needed to recreate the message, or the total number of encoded symbols that can be produced for the message depend, at least in part, on the property of the message.
-
Specification