Joint de-duplication-erasure coded distributed storage
First Claim
1. A non-transitory computer-readable storage device storing computer executable instructions that when executed by a computer control the computer to perform a method for deduplicating and erasure coding a message, the method comprising:
- accessing the message, generating a set of message chunks by chunking the message using a first chunking approach;
generating a set of outer-precoded parity symbols and a set of outer-precoded data symbols from the set of message chunks using an outer precode, where the outer precode is a low density parity check (LDPC) precode that uses an LDPC parity check matrix, where the LDPC parity check matrix has a column weight and a row weight;
selectively adapting the LDPC parity check matrix by;
computing a set of chunk statistics associated with a first set of erasure codes;
generating a worst-case (WC) symbol characterization for a member of the first set of erasure codes based, at least in part, on a WC average number of symbol erasures or a WC average number of symbol errors;
computing an average number of symbol errors for the first set of erasure codes based, at least in part, on a set of deduplication parameters, the set of chunk statistics, and the worst-case symbol characterization;
choosing a column weight or a row weight such that a failure probability Pf is less than a decoding threshold;
upon determining that there has been a change in a chunk pool;
generating a new LDPC parity check matrix based, at least in part, on the column weight and the row weight; and
replacing the LDPC parity check matrix with the new LDPC parity check matrix;
storing the set of outer-precoded parity symbols in a data storage system;
generating a set of unique data symbols by deduplicating the set of outer-precoded data symbols based, at least in part, on a chunk identification (ID) table, where the chunk ID table stores a unique chunk ID associated with a unique data symbol stored in the data storage system, a chunk size associated with the unique data symbol, or a chunk reference count associated with the unique chunk ID, where the chunk ID table is stored in a data storage device with a faster access time than the data storage system;
storing a copy of the set of unique data symbols in the data storage system;
generating a set of inner-precoded data symbols from the set of unique data symbols using an inner-precode;
generating the first set of erasure codes from the set of inner-precoded data symbols using an unequal error protection (UEP) rateless Luby transform (LT) code based, at least in part, on the chunk ID table; and
storing the first set of erasure codes in the data storage system.
8 Assignments
0 Petitions
Accused Products
Abstract
Methods and apparatus deduplicate and erasure code a message in a data storage system. One example apparatus includes a first chunking circuit that generates a set of data chunks from a message, an outer precoding circuit that generates a set of precoded data chunks and a set of parity symbols from the set of data chunks, a second chunking circuit that generates a set of chunked parity symbols from the set of parity symbols, a deduplication circuit that generates a set of deduplicated data chunks by deduplicating the set of precoded chunks or the set of chunked parity symbols, an unequal error protection (UEP) circuit that generates an encoded message from the set of deduplicated data chunks, and a storage circuit that controls the data storage system to store the set of deduplicated data chunks, the set of parity symbols, or the encoded message.
27 Citations
15 Claims
-
1. A non-transitory computer-readable storage device storing computer executable instructions that when executed by a computer control the computer to perform a method for deduplicating and erasure coding a message, the method comprising:
- accessing the message, generating a set of message chunks by chunking the message using a first chunking approach;
generating a set of outer-precoded parity symbols and a set of outer-precoded data symbols from the set of message chunks using an outer precode, where the outer precode is a low density parity check (LDPC) precode that uses an LDPC parity check matrix, where the LDPC parity check matrix has a column weight and a row weight;
selectively adapting the LDPC parity check matrix by;
computing a set of chunk statistics associated with a first set of erasure codes;
generating a worst-case (WC) symbol characterization for a member of the first set of erasure codes based, at least in part, on a WC average number of symbol erasures or a WC average number of symbol errors;
computing an average number of symbol errors for the first set of erasure codes based, at least in part, on a set of deduplication parameters, the set of chunk statistics, and the worst-case symbol characterization;
choosing a column weight or a row weight such that a failure probability Pf is less than a decoding threshold;
upon determining that there has been a change in a chunk pool;
generating a new LDPC parity check matrix based, at least in part, on the column weight and the row weight; and
replacing the LDPC parity check matrix with the new LDPC parity check matrix;
storing the set of outer-precoded parity symbols in a data storage system;
generating a set of unique data symbols by deduplicating the set of outer-precoded data symbols based, at least in part, on a chunk identification (ID) table, where the chunk ID table stores a unique chunk ID associated with a unique data symbol stored in the data storage system, a chunk size associated with the unique data symbol, or a chunk reference count associated with the unique chunk ID, where the chunk ID table is stored in a data storage device with a faster access time than the data storage system;
storing a copy of the set of unique data symbols in the data storage system;
generating a set of inner-precoded data symbols from the set of unique data symbols using an inner-precode;
generating the first set of erasure codes from the set of inner-precoded data symbols using an unequal error protection (UEP) rateless Luby transform (LT) code based, at least in part, on the chunk ID table; and
storing the first set of erasure codes in the data storage system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- accessing the message, generating a set of message chunks by chunking the message using a first chunking approach;
Specification