×

Deduplication using sub-chunk fingerprints

  • US 10,135,462 B1
  • Filed: 06/13/2012
  • Issued: 11/20/2018
  • Est. Priority Date: 06/13/2012
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for storing sub-chunks in a data storage system, the method comprising:

  • selecting a data chunk comprising sub-chunks, the selected data chunk having a set of fingerprints and a set of sketches corresponding to the sub-chunks of the selected data chunk;

    generating a sketch for the selected data chunk;

    searching a set of candidate data chunks using the sketch of the selected data chunk to identify a base data chunk;

    ranking the set of candidate data chunks with at least a minimum degree of similarity by location status data, wherein the location status data indicates a location and status of the base data chunk as in any one of a compressed in a cache status, a decompressed in a cache status, or a compressed in a data storage status, and wherein ranking the set of candidate data chunks using location status data for each candidate prefers a compressed in a cache status over a compressed in a data storage status;

    loading a set of fingerprints and a set of sketches corresponding to sub-chunks of a similar data chunk based on the ranking;

    for each sub-chunk of the selected data chunk;

    searching the set of fingerprints of the similar data chunk to find a match to the fingerprint of the sub-chunk;

    encoding the sub-chunk as a reference to a sub-chunk of the similar data chunk, in response to determining that the fingerprint of the sub-chunk is identical to the fingerprint of the sub-chunk of the similar data chunk;

    in response to determining that the fingerprint of the sub-chunk is not identical to a fingerprint in the set of fingerprints of the similar chunk;

    searching the set of sketches of the similar data chunk corresponding to the sub-chunks of the similar data chunk to find a sketch that is similar to the sketch of the sub-chunk; and

    delta-encoding the sub-chunk as a reference to a sub-chunk of the similar data chunk and delta-encoding metadata, in response to determining that the sketch of the sub-chunk is similar to the sketch of the sub-chunk of the similar data chunk;

    otherwise, storing the sub-chunk in an unencoded form.

View all claims
  • 9 Assignments
Timeline View
Assignment View
    ×
    ×