SCALABLE CHUNK STORE FOR DATA DEDUPLICATION

US 20120131025A1
Filed: 11/18/2010
Published: 05/24/2012
Est. Priority Date: 11/18/2010
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

parsing a data stream into a sequence of data chunks;

determining whether any of the sequence of data chunks are stored in a chunk container that includes a plurality of data chunks; and

storing data chunks of the sequence of data chunks determined to not be stored in the chunk container in a contiguous arrangement and in a same sequence in the chunk container as in the data stream.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Data streams may be stored in a chunk store in the form of stream maps and data chunks. Data chunks corresponding to a data stream may be stored in a chunk container, and a stream map corresponding to the data stream may point to the data chunks in the chunk container. Multiple stream maps may be stored in a stream container, and may point to the data chunks in the chunk container in a manner that duplicate data chunks are not present. Techniques are provided herein for localizing the storage of related data chunks in such chunk containers, for locating data chunks stored in chunk containers, for storing data streams in chunk stores in localized manners that enhance locality and decrease defragmentation, and for reorganizing stored data streams in chunks stores.

Citations

20 Claims

1. A method, comprising:
- parsing a data stream into a sequence of data chunks;
  
  determining whether any of the sequence of data chunks are stored in a chunk container that includes a plurality of data chunks; and
  
  storing data chunks of the sequence of data chunks determined to not be stored in the chunk container in a contiguous arrangement and in a same sequence in the chunk container as in the data stream.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, further comprising:
    - generating metadata for each data chunk of the sequence of data chunks, the metadata for a data chunk of the sequence of data chunks including an offset for the data chunk in the data stream, a pointer to a location in the chunk container for the data chunk, and a locality indicator for the data chunk.
  - 3. The method of claim 2, further comprising:
    - generating a stream map for the data stream that includes the generated metadata; and
      
      persisting the stream map in a chunk store that includes the chunk container.
  - 4. The method of claim 1, further comprising:
    - parsing a second data stream into a second sequence of data chunks;
      
      determining that a first set of data chunks of the second sequence of data chunks includes one or more data chunks that are duplicates of data chunks already stored in the chunk container and that a second set of data chunks of the second sequence of data chunks is not stored in the chunk container;
      
      storing the second set of data chunks in the chunk container in a contiguous arrangement following the stored data chunks of the first sequence of data chunks and in a same sequence as in the second data stream; and
      
      storing a pointer for each of the first set of data chunks to the corresponding data chunk already stored in the chunk container.
  - 5. The method of claim 4, wherein each data chunk of the second data stream has associated metadata including an offset for the data chunk in the second data stream, a pointer to a location in the chunk container for the data chunk, and a locality indicator for the data chunk, wherein each data chunk in the first sequence of data chunks has a first value for the locality indicator, the method further comprising:
    - assigning the first value to the locality indicator for each data chunk of the first set of data chunks;
      
      select a new locality indicator value associated with the second data stream; and
      
      assigning the new locality indicator value to the locality indicator for each data chunk in the second set of data chunks.
  - 6. The method of claim 1, further comprising:
    - in response to a request for a data stream,performing a first seek to locate a first data chunk of a first set of data chunks of the requested data stream in the chunk container,sequentially reading the first set of data chunks from the chunk container,performing a second seek to locate a first data chunk of a second set of data chunks of the requested data stream in the chunk container, andsequentially reading the second set of data chunks from the chunk container.
  - 7. The method of claim 1, further comprising:
    - generating a redirection table associated with the chunk container that stores information regarding data chunk location changes.
  - 8. The method of claim 7, further comprising:
    - receiving a request for a data chunk, the request including an identifier for the data chunk, the data chunk identifier including a chunk container identifier, a local identifier, a chunk container generation value, and a first chunk offset value;
      
      determining that a generation indication for the chunk container matching the chunk container identifier received in the request does not match the chunk container generation value received in the request;
      
      searching the redirection table for an entry that includes a match for the local identifier, the entry including a second chunk offset value that is different from the first chunk offset value; and
      
      retrieving the data chunk from the chunk container at the second chunk offset value.
  - 9. The method of claim 8, wherein the generation indication for the chunk container and the chunk container identifier are included in a header for the chunk container, the method further comprising:
    - modifying the contents of the chunk container;
      
      adding one or more entries to the redirection table that indicate changed chunk offset values for one or more data chunks of the chunk container due to said modifying; and
      
      increasing the generation indication in the chunk container header due to said modifying.
  - 10. The method of claim 7, wherein the information regarding data chunk location changes maps an immutable per-container chunk identifier to a new offset value.
  - 11. The method of claim 8, further comprising:
    - replacing the first chunk offset value with the second chunk offset value in a stream map associated with the data stream; and
      
      deleting the entry from the redirection table.

12. A method for storing a data stream, comprising:
- (a) storing an indication of a minimum allowable number of repeating data chunks;
  
  (b) accumulating a sequence of data chunks from the data stream;
  
  (c) determining whether the accumulated sequence of data chunks is a duplicate of any stored sequence of data chunks, the stored sequence of data chunks being stored contiguously in the chunk container;
  
  (d) if the accumulated sequence of data chunks is determined to be a duplicate of a stored sequence of data chunks, determining whether the accumulated sequence includes a number of data chunks that is greater than or equal to the stored indication; and
  
  (e) storing in the stream metadata pointers to the stored sequence of data chunks if the accumulated sequence of data chunks is determined to have a number of data chunks that is greater than or equal to the stored indication.
- View Dependent Claims (13, 14)
- - 13. The method of claim 12, further comprising:
    - (f) if the accumulated sequence of data chunks is determined to not be a duplicate of any stored sequence of data chunks having a number of data chunks great than or equal to the stored indication,storing a first data chunk of the accumulated sequence in the chunk container,removing the first data chunk from the accumulated sequence of data chunks,accumulating at least one additional data chunk in the accumulated sequence of data chunks to generate an updated accumulated sequence of data chunks.
  - 14. The method of claim 13, further comprising:
    - repeating (a)-(f) until each data chunk of the data stream is stored according to (e) or (f).

15. A method, comprising:
- receiving a portion of a data stream that includes a plurality of data chunks;
  
  determining a plurality of data chunk sequences in the plurality of data chunks, each determined data chunk sequence including a sequence of data chunks duplicating a stored sequence of data chunks stored contiguously in a chunk store;
  
  segmenting the plurality of data chunks into a number of data chunk sets, where the data chunks of each determined data chunk sequence are included together in a data chunk set;
  
  storing data chunks of a first group of the data chunk sets as pointers in data stream metadata to existing data chunks without storing data of the data chunks of the first group, the first group including data chunks sets that are sequences of data chunks duplicating sequences in the chunk store; and
  
  storing data chunks of a second group of the data chunk sets other than data chunks in the first group of the data chunk sets as new contiguous data chunks in the chunk store, the second group at least including data chunks that are not duplicates of data chunks in the chunk store.
- View Dependent Claims (16)
- - 16. The method of claim 15, wherein said segmenting comprises:
    - segmenting the plurality of data chunks into a number of data chunk sets less than or equal to a fragmentation factor.

17. A method of reorganizing stored data streams, comprising:
- prioritizing a plurality of data streams stored as data chunks in a chunk store; and
  
  determining a reorganization of the stored data chunks of the plurality of data streams according to said prioritizing, said reorganizing includingselecting a de-duplicated data stream,relocating one or more data chunks of the selected data stream to decrease fragmentation of the selected data stream by displacing at least one data chunk of a data stream having a lower priority than the selected data stream; and
  
  updating pointers in data stream metadata to the relocated one or more data chunks.
- View Dependent Claims (18, 19, 20)
- - 18. The method of claim 17, wherein said prioritizing comprises:
    - prioritizing the plurality of data streams according to how recently each of the plurality of data streams was accessed.
  - 19. The method of claim 17, wherein said prioritizing comprises:
    - prioritizing the plurality of data streams according to priorities of one or more applications that use one or more of the plurality of data streams.
  - 20. The method of claim 17, wherein said prioritizing comprises:
    - prioritizing the plurality of data streams according to one or more data stream properties.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Oltean, Paul Adrian, Kalach, Ran, Cheung, Chun Ho (Ian), Gupta, Abhishek, Benton, James Robert, Desai, Ronakkumar

Granted Patent

US 10,394,757 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/755
CPC Class Codes

G06F 16/122 using management policies b...

G06F 16/1752 based on file chunks

SCALABLE CHUNK STORE FOR DATA DEDUPLICATION

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SCALABLE CHUNK STORE FOR DATA DEDUPLICATION

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links