Scalable chunk store for data deduplication

US 10,394,757 B2
Filed: 11/18/2010
Issued: 08/27/2019
Est. Priority Date: 11/18/2010
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

parsing a data stream into a sequence of data chunks;

determining whether any of the sequence of data chunks are stored in a chunk container that includes a plurality of data chunks;

storing, in a contiguous arrangement and in a same sequence in the chunk container as in the data stream, data chunks of the sequence of data chunks determined to not be stored in the chunk container;

generating a stream map that is a data structure that describes a mapping between a structure of the data stream and an optimized structure of the data chunks stored in the chunk container to enable data chunks referenced in the stream map to be located in the chunk container, the optimized structure including data chunks that have been deduplicated, the stream map including metadata for each data chunk of the sequence; and

including, in the metadata for each of the data chunks stored in the contiguous arrangement, a same locality indicator value that indicates the contiguous arrangement and indicates that each of the data chunks stored in the contiguous arrangement is associated with the generated stream map.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Data streams may be stored in a chunk store in the form of stream maps and data chunks. Data chunks corresponding to a data stream may be stored in a chunk container, and a stream map corresponding to the data stream may point to the data chunks in the chunk container. Multiple stream maps may be stored in a stream container, and may point to the data chunks in the chunk container in a manner that duplicate data chunks are not present. Techniques are provided herein for localizing the storage of related data chunks in such chunk containers, for locating data chunks stored in chunk containers, for storing data streams in chunk stores in localized manners that enhance locality and decrease defragmentation, and for reorganizing stored data streams in chunks stores.

83 Citations

View as Search Results

19 Claims

1. A method, comprising:
- parsing a data stream into a sequence of data chunks;
  
  determining whether any of the sequence of data chunks are stored in a chunk container that includes a plurality of data chunks;
  
  storing, in a contiguous arrangement and in a same sequence in the chunk container as in the data stream, data chunks of the sequence of data chunks determined to not be stored in the chunk container;
  
  generating a stream map that is a data structure that describes a mapping between a structure of the data stream and an optimized structure of the data chunks stored in the chunk container to enable data chunks referenced in the stream map to be located in the chunk container, the optimized structure including data chunks that have been deduplicated, the stream map including metadata for each data chunk of the sequence; and
  
  including, in the metadata for each of the data chunks stored in the contiguous arrangement, a same locality indicator value that indicates the contiguous arrangement and indicates that each of the data chunks stored in the contiguous arrangement is associated with the generated stream map.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, further comprising:
    - generating the metadata for each data chunk of the sequence of data chunks, the metadata for a data chunk of the sequence of data chunks including an offset for the data chunk in the data stream, a pointer to a location in the chunk container for the data chunk, and the locality indicator for the data chunk.
  - 3. The method of claim 2, further comprising:
    - persisting the stream map in a chunk store that includes the chunk container.
  - 4. The method of claim 1, further comprising:
    - parsing a second data stream into a second sequence of data chunks;
      
      determining that a first set of data chunks of the second sequence of data chunks includes one or more data chunks that are duplicates of data chunks already stored in the chunk container and that a second set of data chunks of the second sequence of data chunks is not stored in the chunk container;
      
      storing the second set of data chunks in the chunk container in a contiguous arrangement following the stored data chunks of the first sequence of data chunks and in a same sequence as in the second data stream; and
      
      storing a pointer for each of the first set of data chunks to the corresponding data chunk already stored in the chunk container.
  - 5. The method of claim 4, wherein each data chunk of the second data stream has associated the metadata including an offset for the data chunk in the second data stream, a pointer to a location in the chunk container for the data chunk, and the locality indicator for the data chunk, wherein each data chunk in the first sequence of data chunks has a first value for the locality indicator, the method further comprising:
    - assigning the first value to the locality indicator for each data chunk of the first set of data chunks;
      
      select a new locality indicator value associated with the second data stream; and
      
      assigning the new locality indicator value to the locality indicator for each data chunk in the second set of data chunks.
  - 6. The method of claim 1, further comprising:
    - in response to a request for a data stream,performing a first seek to locate a first data chunk of a first set of data chunks of the requested data stream in the chunk container,sequentially reading the first set of data chunks from the chunk container,performing a second seek to locate a first data chunk of a second set of data chunks of the requested data stream in the chunk container, andsequentially reading the second set of data chunks from the chunk container.
  - 7. The method of claim 1, further comprising:
    - generating a redirection table associated with the chunk container that stores information regarding data chunk location changes.
  - 8. The method of claim 7, further comprising:
    - receiving a request for a data chunk, the request including an identifier for the data chunk, the data chunk identifier including a chunk container identifier, a local identifier, a chunk container generation value, and a first chunk offset value;
      
      determining that a generation indication for the chunk container matching the chunk container identifier received in the request does not match the chunk container generation value received in the request;
      
      searching the redirection table for an entry that includes a match for the local identifier, the entry including a second chunk offset value that is different from the first chunk offset value; and
      
      retrieving the data chunk from the chunk container at the second chunk offset value.
  - 9. The method of claim 8, wherein the generation indication for the chunk container and the chunk container identifier are included in a header for the chunk container, the method further comprising:
    - modifying the contents of the chunk container;
      
      adding one or more entries to the redirection table that indicate changed chunk offset values for one or more data chunks of the chunk container due to said modifying; and
      
      increasing the generation indication in the chunk container header due to said modifying.
  - 10. The method of claim 8, further comprising:
    - replacing the first chunk offset value with the second chunk offset value in a stream map associated with the data stream; and
      
      deleting the entry from the redirection table.
  - 11. The method of claim 7, wherein the information regarding data chunk location changes maps an immutable per-container chunk identifier to a new offset value.

12. A method for storing a data stream, comprising:
- (a) generating a stream map for the data stream that includes stream metadata;
  
  (b) storing an indication of a minimum allowable number of repeating data chunks in a chunk container;
  
  (c) accumulating a sequence of data chunks from the data stream;
  
  (d) determining whether the accumulated sequence of data chunks is a duplicate of any stored sequence of data chunks, the stored sequence of data chunks being stored contiguously in the chunk container;
  
  (e) in response to determining the accumulated sequence of data chunks is a duplicate of a stored sequence of data chunks, determining whether the accumulated sequence of data chunks includes a number of data chunks that is greater than or equal to the stored indication; and
  
  (f) storing in the stream metadata pointers to the stored sequence of data chunks in response to determining the accumulated sequence of data chunks to have a number of data chunks that is greater than or equal to the stored indication.
- View Dependent Claims (13, 14, 15)
- - 13. The method of claim 12, further comprising:
    - (g) in response to determining the accumulated sequence of data chunks is not a duplicate of any stored sequence of data chunks having a number of data chunks great than or equal to the stored indication,storing a first data chunk of the accumulated sequence in the chunk container,removing the first data chunk from the accumulated sequence of data chunks,accumulating at least one additional data chunk in the accumulated sequence of data chunks to generate an updated accumulated sequence of data chunks.
  - 14. The method of claim 13, further comprising:
    - repeating (b)-(g) until each data chunk of the data stream is stored according to (f) or (g).
  - 15. The method of claim 12, further comprising:
    - completing generation of the stream map; and
      
      storing the stream map in a stream container.

16. A method, comprising:
- receiving a portion of a data stream that includes a plurality of data chunks;
  
  determining a plurality of data chunk sequences in the plurality of data chunks, each determined data chunk sequence including a sequence of data chunks duplicating a stored sequence of data chunks stored contiguously in a chunk store;
  
  segmenting the plurality of data chunks into a number of data chunk sets corresponding to a fragmentation factor, where the data chunks of each determined data chunk sequence are included together in a data chunk set and the fragmentation factor indicates a maximum fragmentation for the segmenting of the plurality of data chunks;
  
  storing data chunks of a first group of the data chunk sets as pointers in data stream metadata to existing data chunks without storing data of the data chunks of the first group, the first group including data chunks sets that are sequences of data chunks duplicating sequences in the chunk store; and
  
  storing data chunks of a second group of the data chunk sets other than data chunks in the first group of the data chunk sets as new contiguous data chunks in the chunk store, the second group at least including data chunks that are not duplicates of data chunks in the chunk store.
- View Dependent Claims (17, 18, 19)
- - 17. The method of claim 16, wherein said segmenting comprises:
    - segmenting the plurality of data chunks into a number of data chunk sets less than or equal to the fragmentation factor.
  - 18. The method of claim 16, further comprising:
    - storing duplicate data chunks of the second group of the data chunk sets in the chunk store.
  - 19. The method of claim 16, further comprising:
    - storing the data chunks of the second group of data chunks as second pointers in the data stream metadata.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Cheung, Chun Ho (Ian), Oltean, Paul Adrian, Kalach, Ran, Gupta, Abhishek, Benton, James Robert, Desai, Ronakkumar
Primary Examiner(s)
Singh, Amresh

Application Number

US12/949,391
Publication Number

US 20120131025A1
Time in Patent Office

3,204 Days
Field of Search

707755, 707802, 707803
US Class Current
CPC Class Codes

G06F 16/122 using management policies b...

G06F 16/1752 based on file chunks

Scalable chunk store for data deduplication

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

83 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Scalable chunk store for data deduplication

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

83 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links