SCALABLE CHUNK STORE FOR DATA DEDUPLICATION
First Claim
1. A method, comprising:
- parsing a data stream into a sequence of data chunks;
determining whether any of the sequence of data chunks are stored in a chunk container that includes a plurality of data chunks; and
storing data chunks of the sequence of data chunks determined to not be stored in the chunk container in a contiguous arrangement and in a same sequence in the chunk container as in the data stream.
2 Assignments
0 Petitions
Accused Products
Abstract
Data streams may be stored in a chunk store in the form of stream maps and data chunks. Data chunks corresponding to a data stream may be stored in a chunk container, and a stream map corresponding to the data stream may point to the data chunks in the chunk container. Multiple stream maps may be stored in a stream container, and may point to the data chunks in the chunk container in a manner that duplicate data chunks are not present. Techniques are provided herein for localizing the storage of related data chunks in such chunk containers, for locating data chunks stored in chunk containers, for storing data streams in chunk stores in localized manners that enhance locality and decrease defragmentation, and for reorganizing stored data streams in chunks stores.
-
Citations
20 Claims
-
1. A method, comprising:
-
parsing a data stream into a sequence of data chunks; determining whether any of the sequence of data chunks are stored in a chunk container that includes a plurality of data chunks; and storing data chunks of the sequence of data chunks determined to not be stored in the chunk container in a contiguous arrangement and in a same sequence in the chunk container as in the data stream. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A method for storing a data stream, comprising:
-
(a) storing an indication of a minimum allowable number of repeating data chunks; (b) accumulating a sequence of data chunks from the data stream; (c) determining whether the accumulated sequence of data chunks is a duplicate of any stored sequence of data chunks, the stored sequence of data chunks being stored contiguously in the chunk container; (d) if the accumulated sequence of data chunks is determined to be a duplicate of a stored sequence of data chunks, determining whether the accumulated sequence includes a number of data chunks that is greater than or equal to the stored indication; and (e) storing in the stream metadata pointers to the stored sequence of data chunks if the accumulated sequence of data chunks is determined to have a number of data chunks that is greater than or equal to the stored indication. - View Dependent Claims (13, 14)
-
-
15. A method, comprising:
-
receiving a portion of a data stream that includes a plurality of data chunks; determining a plurality of data chunk sequences in the plurality of data chunks, each determined data chunk sequence including a sequence of data chunks duplicating a stored sequence of data chunks stored contiguously in a chunk store; segmenting the plurality of data chunks into a number of data chunk sets, where the data chunks of each determined data chunk sequence are included together in a data chunk set; storing data chunks of a first group of the data chunk sets as pointers in data stream metadata to existing data chunks without storing data of the data chunks of the first group, the first group including data chunks sets that are sequences of data chunks duplicating sequences in the chunk store; and storing data chunks of a second group of the data chunk sets other than data chunks in the first group of the data chunk sets as new contiguous data chunks in the chunk store, the second group at least including data chunks that are not duplicates of data chunks in the chunk store. - View Dependent Claims (16)
-
-
17. A method of reorganizing stored data streams, comprising:
-
prioritizing a plurality of data streams stored as data chunks in a chunk store; and determining a reorganization of the stored data chunks of the plurality of data streams according to said prioritizing, said reorganizing including selecting a de-duplicated data stream, relocating one or more data chunks of the selected data stream to decrease fragmentation of the selected data stream by displacing at least one data chunk of a data stream having a lower priority than the selected data stream; and updating pointers in data stream metadata to the relocated one or more data chunks. - View Dependent Claims (18, 19, 20)
-
Specification