System and method for balancing compression and read performance in a storage system
First Claim
1. A computer-implemented method for balancing data compression and read performance of data chunks of a storage system, the method comprising:
- identifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system;
ordering the similar data chunks of the storage system to be positioned close to each other byscanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, wherein each sketch includes a plurality of super features, each super feature being based on hashing one or more concatenated maximum hashes or minimum hashes of sub-regions of the corresponding data chunk,storing the chunk IDs and sketches in a data structure, wherein the data structure includes a plurality of entries, each corresponding to one of the sketches and its respective chunk ID, andsorting the entries of the data structure based on the sketches of the plurality of data chunks of the storage system, includingdetermining that a first sketch of the sketches includes a first feature and a second feature,sorting the entries of the data structure based on the first feature,identifying a subset of the entries of the data structure that are associated with the first feature, andsorting the subset of the entries of the data structure based on the second feature, wherein the similar data chunks of the storage system are rearranged based on the sorted entries such that similar data chunks of the storage system are positioned close to each other;
associating a first portion of the similar data chunks as a first group with a first storage container;
associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together;
compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container, wherein the first storage container contains a plurality of compression regions, each compression region storing a plurality of data chunks and is represented by a region sketch that is generated based on sketches of the plurality of data chunks stored therein for purposes of identifying similar data chunks, wherein the region sketch is generated by one or more selected super features for the container, wherein the one or more selected super features includes;
a maximum chunk super feature, or a minimum chunk super feature; and
storing the first storage container in a persistent storage device of the storage system that stores a plurality of storage containers, wherein a data chunk stored in the persistent storage device is accessed by loading an entire compression region of a container associated with the data chunk into a memory, such that a number of input and output (TO) transactions is reduced.
11 Assignments
0 Petitions
Accused Products
Abstract
Techniques for balancing data compression and read performance of data chunks of a storage system are described herein. According to one embodiment, similar data chunks are identified based on sketches of a plurality of data chunks stored in the storage system. A first portion of the similar data chunks as a first group is associated with a first storage area. The first storage area is associated with one or more data chunks that are dissimilar to the first group but are likely accessed together. The first group of the similar data chunks and its associated dissimilar data chunks are compressed and stored in the first storage area.
-
Citations
26 Claims
-
1. A computer-implemented method for balancing data compression and read performance of data chunks of a storage system, the method comprising:
-
identifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system; ordering the similar data chunks of the storage system to be positioned close to each other by scanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, wherein each sketch includes a plurality of super features, each super feature being based on hashing one or more concatenated maximum hashes or minimum hashes of sub-regions of the corresponding data chunk, storing the chunk IDs and sketches in a data structure, wherein the data structure includes a plurality of entries, each corresponding to one of the sketches and its respective chunk ID, and sorting the entries of the data structure based on the sketches of the plurality of data chunks of the storage system, including determining that a first sketch of the sketches includes a first feature and a second feature, sorting the entries of the data structure based on the first feature, identifying a subset of the entries of the data structure that are associated with the first feature, and sorting the subset of the entries of the data structure based on the second feature, wherein the similar data chunks of the storage system are rearranged based on the sorted entries such that similar data chunks of the storage system are positioned close to each other; associating a first portion of the similar data chunks as a first group with a first storage container; associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together; compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container, wherein the first storage container contains a plurality of compression regions, each compression region storing a plurality of data chunks and is represented by a region sketch that is generated based on sketches of the plurality of data chunks stored therein for purposes of identifying similar data chunks, wherein the region sketch is generated by one or more selected super features for the container, wherein the one or more selected super features includes;
a maximum chunk super feature, or a minimum chunk super feature; andstoring the first storage container in a persistent storage device of the storage system that stores a plurality of storage containers, wherein a data chunk stored in the persistent storage device is accessed by loading an entire compression region of a container associated with the data chunk into a memory, such that a number of input and output (TO) transactions is reduced. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 26)
-
-
11. A non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations for balancing data compression and read performance of data chunks of a storage system, the operations comprising:
-
identifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system; ordering the similar data chunks of the storage system to be positioned close to each other by scanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, wherein each sketch includes a plurality of super features, each super feature being based on hashing one or more concatenated maximum hashes or minimum hashes of sub-regions of the corresponding data chunk, storing the chunk IDs and sketches in a data structure, wherein the data structure includes a plurality of entries, each corresponding to one of the sketches and its respective chunk ID, and sorting the entries of the data structure based on the sketches of the plurality of data chunks of the storage system, including determining that a first sketch of the sketches includes a first feature and a second feature, sorting the entries of the data structure based on the first feature, identifying a subset of the entries of the data structure that are associated with the first feature, and sorting the subset of the entries of the data structure based on the second feature, wherein the similar data chunks of the storage system are rearranged based on the sorted entries such that similar data chunks of the storage system are positioned close to each other; associating a first portion of the similar data chunks as a first group with a first storage container; associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together; compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container, wherein the first storage container contains a plurality of compression regions, each compression region storing a plurality of data chunks and is represented by a region sketch that is generated based on sketches of the plurality of data chunks stored therein for purposes of identifying similar data chunks, wherein the region sketch is generated by one or more selected super features for the container, wherein the one or more selected super features includes;
a maximum chunk super feature, or a minimum chunk super feature; andstoring the first storage container in a persistent storage device of the storage system that stores a plurality of storage containers, wherein a data chunk stored in the persistent storage device is accessed by loading an entire compression region of a container associated with the data chunk into a memory, such that a number of input and output (TO) transactions is reduced. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A data processing system, comprising:
-
a processor; and a memory coupled to the processor for storing instructions, which when executed by from the memory, cause the processor to perform operations, the operations including identifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system; ordering the similar data chunks of the storage system to be positioned close to each other by scanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, wherein each sketch includes a plurality of super features, each super feature being based on hashing one or more concatenated maximum hashes or minimum hashes of sub-regions of the corresponding data chunk, storing the chunk IDs and sketches in a data structure, wherein the data structure includes a plurality of entries, each corresponding to one of the sketches and its respective chunk ID, and sorting the entries of the data structure based on the sketches of the plurality of data chunks of the storage system, including determining that a first sketch of the sketches includes a first feature and a second feature, sorting the entries of the data structure based on the first feature, identifying a subset of the entries of the data structure that are associated with the first feature, and sorting the subset of the entries of the data structure based on the second feature, wherein the similar data chunks of the storage system are rearranged based on the sorted entries such that similar data chunks of the storage system are positioned close to each other, associating a first portion of the similar data chunks as a first group with a first storage container, associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together, compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container, wherein the first storage container contains a plurality of compression regions, each compression region storing a plurality of data chunks and is represented by a region sketch that is generated based on sketches of the plurality of data chunks stored therein for purposes of identifying similar data chunks, wherein the region sketch is generated by one or more selected super features for the container, wherein the one or more selected super features includes;
a maximum chunk super feature, or a minimum chunk super feature, andstoring the first storage container in a persistent storage device of the storage system that stores a plurality of storage containers, wherein a data chunk stored in the persistent storage device is accessed by loading an entire compression region of a container associated with the data chunk into a memory, such that a number of input and output (TO) transactions is reduced. - View Dependent Claims (22, 23, 24, 25)
-
Specification