System and method for balancing compression and read performance in a storage system

US 10,216,754 B1
Filed: 09/26/2013
Issued: 02/26/2019
Est. Priority Date: 09/26/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for balancing data compression and read performance of data chunks of a storage system, the method comprising:

identifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system;

ordering the similar data chunks of the storage system to be positioned close to each other byscanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, wherein each sketch includes a plurality of super features, each super feature being based on hashing one or more concatenated maximum hashes or minimum hashes of sub-regions of the corresponding data chunk,storing the chunk IDs and sketches in a data structure, wherein the data structure includes a plurality of entries, each corresponding to one of the sketches and its respective chunk ID, andsorting the entries of the data structure based on the sketches of the plurality of data chunks of the storage system, includingdetermining that a first sketch of the sketches includes a first feature and a second feature,sorting the entries of the data structure based on the first feature,identifying a subset of the entries of the data structure that are associated with the first feature, andsorting the subset of the entries of the data structure based on the second feature, wherein the similar data chunks of the storage system are rearranged based on the sorted entries such that similar data chunks of the storage system are positioned close to each other;

associating a first portion of the similar data chunks as a first group with a first storage container;

associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together;

compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container, wherein the first storage container contains a plurality of compression regions, each compression region storing a plurality of data chunks and is represented by a region sketch that is generated based on sketches of the plurality of data chunks stored therein for purposes of identifying similar data chunks, wherein the region sketch is generated by one or more selected super features for the container, wherein the one or more selected super features includes;

a maximum chunk super feature, or a minimum chunk super feature; and

storing the first storage container in a persistent storage device of the storage system that stores a plurality of storage containers, wherein a data chunk stored in the persistent storage device is accessed by loading an entire compression region of a container associated with the data chunk into a memory, such that a number of input and output (TO) transactions is reduced.

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for balancing data compression and read performance of data chunks of a storage system are described herein. According to one embodiment, similar data chunks are identified based on sketches of a plurality of data chunks stored in the storage system. A first portion of the similar data chunks as a first group is associated with a first storage area. The first storage area is associated with one or more data chunks that are dissimilar to the first group but are likely accessed together. The first group of the similar data chunks and its associated dissimilar data chunks are compressed and stored in the first storage area.

Citations

26 Claims

1. A computer-implemented method for balancing data compression and read performance of data chunks of a storage system, the method comprising:
- identifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system;
  
  ordering the similar data chunks of the storage system to be positioned close to each other byscanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, wherein each sketch includes a plurality of super features, each super feature being based on hashing one or more concatenated maximum hashes or minimum hashes of sub-regions of the corresponding data chunk,storing the chunk IDs and sketches in a data structure, wherein the data structure includes a plurality of entries, each corresponding to one of the sketches and its respective chunk ID, andsorting the entries of the data structure based on the sketches of the plurality of data chunks of the storage system, includingdetermining that a first sketch of the sketches includes a first feature and a second feature,sorting the entries of the data structure based on the first feature,identifying a subset of the entries of the data structure that are associated with the first feature, andsorting the subset of the entries of the data structure based on the second feature, wherein the similar data chunks of the storage system are rearranged based on the sorted entries such that similar data chunks of the storage system are positioned close to each other;
  
  associating a first portion of the similar data chunks as a first group with a first storage container;
  
  associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together;
  
  compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container, wherein the first storage container contains a plurality of compression regions, each compression region storing a plurality of data chunks and is represented by a region sketch that is generated based on sketches of the plurality of data chunks stored therein for purposes of identifying similar data chunks, wherein the region sketch is generated by one or more selected super features for the container, wherein the one or more selected super features includes;
  
  a maximum chunk super feature, or a minimum chunk super feature; and
  
  storing the first storage container in a persistent storage device of the storage system that stores a plurality of storage containers, wherein a data chunk stored in the persistent storage device is accessed by loading an entire compression region of a container associated with the data chunk into a memory, such that a number of input and output (TO) transactions is reduced.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 26)
- - 2. The method of claim 1, further comprising:
    - associating a second portion of the similar data chunks as a second group with a second storage area;
      
      associating with the second storage area one or more data chunks that are dissimilar to the second group but are likely accessed together; and
      
      compressing and storing the second group of the similar data chunks and its associated dissimilar data chunks in the second storage area.
  - 3. The method of claim 1, wherein a number of similar data chunks associated with the first storage container is limited to a predetermined minimum or maximum threshold.
  - 4. The method of claim 1, wherein the dissimilar data chunks are located near one or more of the similar data chunks in one or more files.
  - 5. The method of claim 1, wherein the dissimilar data chunks were accessed within a predetermined period of time in which the similar data chunks were accessed.
  - 6. The method of claim 1, wherein the similar data chunks are identified from data chunks associated with one or more files that have not been accessed for a predetermined period of time.
  - 7. The method of claim 6, wherein data chunks that have been recently accessed are not reorganized based on their similarity.
  - 8. The method of claim 1, wherein the dissimilar chunks include a second group of similar data chunks that is not similar to the first group of similar data chunks.
  - 9. The method of claim 8, wherein the similar data chunks of the first group represents different versions of a first data chunk, and wherein the similar data chunks of the second group represents different versions of a second data chunk.
  - 10. The method of claim 1, further comprising:
    - determining that a third data chunk compressed and stored in a third storage area and a fourth data chunk compressed and stored in a fourth storage area are accessed frequently; and
      
      reorganizing data chunks stored in the third and fourth storage areas, such that the third data chunk and the fourth data chunk are compressed and stored together regardless whether they are similar.
  - 26. The method of claim 1, further comprising reorganizing newly received data chunks online byidentifying newly received data chunks;
    - andincorporating newly received data chunks with existing similar data chunks based on their respective sketches prior to compressing the newly received data chunks for storage.

11. A non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations for balancing data compression and read performance of data chunks of a storage system, the operations comprising:
- identifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system;
  
  ordering the similar data chunks of the storage system to be positioned close to each other byscanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, wherein each sketch includes a plurality of super features, each super feature being based on hashing one or more concatenated maximum hashes or minimum hashes of sub-regions of the corresponding data chunk,storing the chunk IDs and sketches in a data structure, wherein the data structure includes a plurality of entries, each corresponding to one of the sketches and its respective chunk ID, andsorting the entries of the data structure based on the sketches of the plurality of data chunks of the storage system, includingdetermining that a first sketch of the sketches includes a first feature and a second feature,sorting the entries of the data structure based on the first feature,identifying a subset of the entries of the data structure that are associated with the first feature, andsorting the subset of the entries of the data structure based on the second feature,wherein the similar data chunks of the storage system are rearranged based on the sorted entries such that similar data chunks of the storage system are positioned close to each other;
  
  associating a first portion of the similar data chunks as a first group with a first storage container;
  
  associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together;
  
  compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container, wherein the first storage container contains a plurality of compression regions, each compression region storing a plurality of data chunks and is represented by a region sketch that is generated based on sketches of the plurality of data chunks stored therein for purposes of identifying similar data chunks, wherein the region sketch is generated by one or more selected super features for the container, wherein the one or more selected super features includes;
  
  a maximum chunk super feature, or a minimum chunk super feature; and
  
  storing the first storage container in a persistent storage device of the storage system that stores a plurality of storage containers, wherein a data chunk stored in the persistent storage device is accessed by loading an entire compression region of a container associated with the data chunk into a memory, such that a number of input and output (TO) transactions is reduced.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The non-transitory machine-readable medium of claim 11, wherein the operations further comprise:
    - associating a second portion of the similar data chunks as a second group with a second storage container;
      
      associating with the second storage container one or more data chunks that are dissimilar to the second group but are likely accessed together; and
      
      compressing and storing the second group of the similar data chunks and its associated dissimilar data chunks in the second storage container.
  - 13. The non-transitory machine-readable medium of claim 11, wherein a number of similar data chunks associated with the first storage container is limited to a predetermined minimum or maximum threshold.
  - 14. The non-transitory machine-readable medium of claim 11, wherein the dissimilar data chunks are located near one or more of the similar data chunks in one or more files.
  - 15. The non-transitory machine-readable medium of claim 11, wherein the dissimilar data chunks were accessed within a predetermined period of time in which the similar data chunks were accessed.
  - 16. The non-transitory machine-readable medium of claim 11, wherein the similar data chunks are identified from data chunks associated with one or more files that have not been accessed for a predetermined period of time.
  - 17. The non-transitory machine-readable medium of claim 16, wherein data chunks that have been recently accessed are not reorganized based on their similarity.
  - 18. The non-transitory machine-readable medium of claim 11, wherein the dissimilar chunks include a second group of similar data chunks that is not similar to the first group of similar data chunks.
  - 19. The non-transitory machine-readable medium of claim 18, wherein the similar data chunks of the first group represents different versions of a first data chunk, and wherein the similar data chunks of the second group represents different versions of a second data chunk.
  - 20. The non-transitory machine-readable medium of claim 11, wherein operations further comprise:
    - determining that a third data chunk compressed and stored in a third storage container and a fourth data chunk compressed and stored in a fourth storage container are accessed frequently; and
      
      reorganizing data chunks stored in the third and fourth storage containers, such that the third data chunk and the fourth data chunk are compressed and stored together regardless whether they are similar.

21. A data processing system, comprising:
- a processor; and
  
  a memory coupled to the processor for storing instructions, which when executed by from the memory, cause the processor to perform operations, the operations includingidentifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system;
  
  ordering the similar data chunks of the storage system to be positioned close to each other byscanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, wherein each sketch includes a plurality of super features, each super feature being based on hashing one or more concatenated maximum hashes or minimum hashes of sub-regions of the corresponding data chunk,storing the chunk IDs and sketches in a data structure, wherein the data structure includes a plurality of entries, each corresponding to one of the sketches and its respective chunk ID, andsorting the entries of the data structure based on the sketches of the plurality of data chunks of the storage system, includingdetermining that a first sketch of the sketches includes a first feature and a second feature,sorting the entries of the data structure based on the first feature,identifying a subset of the entries of the data structure that are associated with the first feature, andsorting the subset of the entries of the data structure based on the second feature,wherein the similar data chunks of the storage system are rearranged based on the sorted entries such that similar data chunks of the storage system are positioned close to each other,associating a first portion of the similar data chunks as a first group with a first storage container,associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together,compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container, wherein the first storage container contains a plurality of compression regions, each compression region storing a plurality of data chunks and is represented by a region sketch that is generated based on sketches of the plurality of data chunks stored therein for purposes of identifying similar data chunks, wherein the region sketch is generated by one or more selected super features for the container, wherein the one or more selected super features includes;
  
  a maximum chunk super feature, or a minimum chunk super feature, andstoring the first storage container in a persistent storage device of the storage system that stores a plurality of storage containers, wherein a data chunk stored in the persistent storage device is accessed by loading an entire compression region of a container associated with the data chunk into a memory, such that a number of input and output (TO) transactions is reduced.
- View Dependent Claims (22, 23, 24, 25)
- - 22. The system of claim 21, wherein the operations further comprise:
    - associating a second portion of the similar data chunks as a second group with a second storage container;
      
      associating with the second storage container one or more data chunks that are dissimilar to the second group but are likely accessed together; and
      
      compressing and storing the second group of the similar data chunks and its associated dissimilar data chunks in the second storage container.
  - 23. The system of claim 21, wherein a number of similar data chunks associated with the first storage container is limited to a predetermined minimum or maximum threshold.
  - 24. The system of claim 21, wherein the dissimilar data chunks are located near one or more of the similar data chunks in one or more files.
  - 25. The system of claim 21, wherein the dissimilar data chunks were accessed within a predetermined period of time in which the similar data chunks were accessed.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Original Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Inventors
Douglis, Frederick, Shilane, Philip, Wallace, Grant
Primary Examiner(s)
Hu, Jensen

Application Number

US14/038,632
Time in Patent Office

1,979 Days
Field of Search

707693
US Class Current
CPC Class Codes

G06F 16/1744 using compression, e.g. spa...

System and method for balancing compression and read performance in a storage system

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for balancing compression and read performance in a storage system

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links