×

System and method for balancing compression and read performance in a storage system

  • US 10,216,754 B1
  • Filed: 09/26/2013
  • Issued: 02/26/2019
  • Est. Priority Date: 09/26/2013
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for balancing data compression and read performance of data chunks of a storage system, the method comprising:

  • identifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system;

    ordering the similar data chunks of the storage system to be positioned close to each other byscanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, wherein each sketch includes a plurality of super features, each super feature being based on hashing one or more concatenated maximum hashes or minimum hashes of sub-regions of the corresponding data chunk,storing the chunk IDs and sketches in a data structure, wherein the data structure includes a plurality of entries, each corresponding to one of the sketches and its respective chunk ID, andsorting the entries of the data structure based on the sketches of the plurality of data chunks of the storage system, includingdetermining that a first sketch of the sketches includes a first feature and a second feature,sorting the entries of the data structure based on the first feature,identifying a subset of the entries of the data structure that are associated with the first feature, andsorting the subset of the entries of the data structure based on the second feature, wherein the similar data chunks of the storage system are rearranged based on the sorted entries such that similar data chunks of the storage system are positioned close to each other;

    associating a first portion of the similar data chunks as a first group with a first storage container;

    associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together;

    compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container, wherein the first storage container contains a plurality of compression regions, each compression region storing a plurality of data chunks and is represented by a region sketch that is generated based on sketches of the plurality of data chunks stored therein for purposes of identifying similar data chunks, wherein the region sketch is generated by one or more selected super features for the container, wherein the one or more selected super features includes;

    a maximum chunk super feature, or a minimum chunk super feature; and

    storing the first storage container in a persistent storage device of the storage system that stores a plurality of storage containers, wherein a data chunk stored in the persistent storage device is accessed by loading an entire compression region of a container associated with the data chunk into a memory, such that a number of input and output (TO) transactions is reduced.

View all claims
  • 11 Assignments
Timeline View
Assignment View
    ×
    ×