Scalable chunk store for data deduplication
First Claim
1. A method, comprising:
- parsing a data stream into a sequence of data chunks;
determining whether any of the sequence of data chunks are stored in a chunk container that includes a plurality of data chunks;
storing, in a contiguous arrangement and in a same sequence in the chunk container as in the data stream, data chunks of the sequence of data chunks determined to not be stored in the chunk container;
generating a stream map that is a data structure that describes a mapping between a structure of the data stream and an optimized structure of the data chunks stored in the chunk container to enable data chunks referenced in the stream map to be located in the chunk container, the optimized structure including data chunks that have been deduplicated, the stream map including metadata for each data chunk of the sequence; and
including, in the metadata for each of the data chunks stored in the contiguous arrangement, a same locality indicator value that indicates the contiguous arrangement and indicates that each of the data chunks stored in the contiguous arrangement is associated with the generated stream map.
2 Assignments
0 Petitions
Accused Products
Abstract
Data streams may be stored in a chunk store in the form of stream maps and data chunks. Data chunks corresponding to a data stream may be stored in a chunk container, and a stream map corresponding to the data stream may point to the data chunks in the chunk container. Multiple stream maps may be stored in a stream container, and may point to the data chunks in the chunk container in a manner that duplicate data chunks are not present. Techniques are provided herein for localizing the storage of related data chunks in such chunk containers, for locating data chunks stored in chunk containers, for storing data streams in chunk stores in localized manners that enhance locality and decrease defragmentation, and for reorganizing stored data streams in chunks stores.
83 Citations
19 Claims
-
1. A method, comprising:
-
parsing a data stream into a sequence of data chunks; determining whether any of the sequence of data chunks are stored in a chunk container that includes a plurality of data chunks; storing, in a contiguous arrangement and in a same sequence in the chunk container as in the data stream, data chunks of the sequence of data chunks determined to not be stored in the chunk container; generating a stream map that is a data structure that describes a mapping between a structure of the data stream and an optimized structure of the data chunks stored in the chunk container to enable data chunks referenced in the stream map to be located in the chunk container, the optimized structure including data chunks that have been deduplicated, the stream map including metadata for each data chunk of the sequence; and including, in the metadata for each of the data chunks stored in the contiguous arrangement, a same locality indicator value that indicates the contiguous arrangement and indicates that each of the data chunks stored in the contiguous arrangement is associated with the generated stream map. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A method for storing a data stream, comprising:
-
(a) generating a stream map for the data stream that includes stream metadata; (b) storing an indication of a minimum allowable number of repeating data chunks in a chunk container; (c) accumulating a sequence of data chunks from the data stream; (d) determining whether the accumulated sequence of data chunks is a duplicate of any stored sequence of data chunks, the stored sequence of data chunks being stored contiguously in the chunk container; (e) in response to determining the accumulated sequence of data chunks is a duplicate of a stored sequence of data chunks, determining whether the accumulated sequence of data chunks includes a number of data chunks that is greater than or equal to the stored indication; and (f) storing in the stream metadata pointers to the stored sequence of data chunks in response to determining the accumulated sequence of data chunks to have a number of data chunks that is greater than or equal to the stored indication. - View Dependent Claims (13, 14, 15)
-
-
16. A method, comprising:
-
receiving a portion of a data stream that includes a plurality of data chunks; determining a plurality of data chunk sequences in the plurality of data chunks, each determined data chunk sequence including a sequence of data chunks duplicating a stored sequence of data chunks stored contiguously in a chunk store; segmenting the plurality of data chunks into a number of data chunk sets corresponding to a fragmentation factor, where the data chunks of each determined data chunk sequence are included together in a data chunk set and the fragmentation factor indicates a maximum fragmentation for the segmenting of the plurality of data chunks; storing data chunks of a first group of the data chunk sets as pointers in data stream metadata to existing data chunks without storing data of the data chunks of the first group, the first group including data chunks sets that are sequences of data chunks duplicating sequences in the chunk store; and storing data chunks of a second group of the data chunk sets other than data chunks in the first group of the data chunk sets as new contiguous data chunks in the chunk store, the second group at least including data chunks that are not duplicates of data chunks in the chunk store. - View Dependent Claims (17, 18, 19)
-
Specification