Deduplication using sub-chunk fingerprints
First Claim
Patent Images
1. A computer-implemented method for storing sub-chunks in a data storage system, the method comprising:
- selecting a data chunk comprising sub-chunks, the selected data chunk having a set of fingerprints and a set of sketches corresponding to the sub-chunks of the selected data chunk;
generating a sketch for the selected data chunk;
searching a set of candidate data chunks using the sketch of the selected data chunk to identify a base data chunk;
ranking the set of candidate data chunks with at least a minimum degree of similarity by location status data, wherein the location status data indicates a location and status of the base data chunk as in any one of a compressed in a cache status, a decompressed in a cache status, or a compressed in a data storage status, and wherein ranking the set of candidate data chunks using location status data for each candidate prefers a compressed in a cache status over a compressed in a data storage status;
loading a set of fingerprints and a set of sketches corresponding to sub-chunks of a similar data chunk based on the ranking;
for each sub-chunk of the selected data chunk;
searching the set of fingerprints of the similar data chunk to find a match to the fingerprint of the sub-chunk;
encoding the sub-chunk as a reference to a sub-chunk of the similar data chunk, in response to determining that the fingerprint of the sub-chunk is identical to the fingerprint of the sub-chunk of the similar data chunk;
in response to determining that the fingerprint of the sub-chunk is not identical to a fingerprint in the set of fingerprints of the similar chunk;
searching the set of sketches of the similar data chunk corresponding to the sub-chunks of the similar data chunk to find a sketch that is similar to the sketch of the sub-chunk; and
delta-encoding the sub-chunk as a reference to a sub-chunk of the similar data chunk and delta-encoding metadata, in response to determining that the sketch of the sub-chunk is similar to the sketch of the sub-chunk of the similar data chunk;
otherwise, storing the sub-chunk in an unencoded form.
9 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented method and system for deduplicating sub-chunks in a data storage system selects a data chunk to deduplicate and generates a sketch for the selected data chunk. A similar data chunk is searched for using the sketch. A set of fingerprints corresponding to sub-chunks of the similar data chunk is loaded. The set of fingerprints for the similar data chunk is compared to a set of fingerprints of the selected data chunk and the selected chunk is encoded as a set of references to identical sub-chunks of the similar data chunk and at least one unmatched sub-chunk.
-
Citations
25 Claims
-
1. A computer-implemented method for storing sub-chunks in a data storage system, the method comprising:
-
selecting a data chunk comprising sub-chunks, the selected data chunk having a set of fingerprints and a set of sketches corresponding to the sub-chunks of the selected data chunk; generating a sketch for the selected data chunk; searching a set of candidate data chunks using the sketch of the selected data chunk to identify a base data chunk; ranking the set of candidate data chunks with at least a minimum degree of similarity by location status data, wherein the location status data indicates a location and status of the base data chunk as in any one of a compressed in a cache status, a decompressed in a cache status, or a compressed in a data storage status, and wherein ranking the set of candidate data chunks using location status data for each candidate prefers a compressed in a cache status over a compressed in a data storage status; loading a set of fingerprints and a set of sketches corresponding to sub-chunks of a similar data chunk based on the ranking; for each sub-chunk of the selected data chunk; searching the set of fingerprints of the similar data chunk to find a match to the fingerprint of the sub-chunk; encoding the sub-chunk as a reference to a sub-chunk of the similar data chunk, in response to determining that the fingerprint of the sub-chunk is identical to the fingerprint of the sub-chunk of the similar data chunk; in response to determining that the fingerprint of the sub-chunk is not identical to a fingerprint in the set of fingerprints of the similar chunk; searching the set of sketches of the similar data chunk corresponding to the sub-chunks of the similar data chunk to find a sketch that is similar to the sketch of the sub-chunk; and delta-encoding the sub-chunk as a reference to a sub-chunk of the similar data chunk and delta-encoding metadata, in response to determining that the sketch of the sub-chunk is similar to the sketch of the sub-chunk of the similar data chunk; otherwise, storing the sub-chunk in an unencoded form. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 23, 24, 25)
-
-
10. A non-transitory computer-readable storage medium having instructions stored therein, which when executed by a computer, cause the computer to perform a method, the method for storing sub-chunks in a data storage system, the method comprising:
-
selecting a data chunk comprising sub-chunks, the selected data chunk having a set of fingerprints and a set of sketches corresponding to the sub-chunks of the selected data chunk; generating a sketch for the selected data chunk; searching a set of candidate data chunks using the sketch of the selected data chunk to identify a base data chunk; ranking the set of candidate data chunks with at least a minimum degree of similarity by location status data, wherein the location status data indicates a location and status of the base data chunk as in any one of a compressed in a cache status, a decompressed in a cache status, or a compressed in a data storage status, and wherein ranking the set of candidate data chunks using location status data for each candidate prefers a compressed in a cache status over a compressed in a data storage status; loading a set of fingerprints and a set of sketches corresponding to sub-chunks of a similar data chunk based on the ranking; for each sub-chunk of the selected data chunk; searching the set of fingerprints for the similar data chunk to find a match to the fingerprint of the sub-chunk; encoding the sub-chunk as a reference to a sub-chunk of the similar data chunk, in response to determining that the fingerprint of the sub-chunk is identical to the fingerprint of the sub-chunk of the similar data chunk; in response to determining that the fingerprint of the sub-chunk is not identical to a fingerprint in the set of fingerprints of the similar chunk; searching the set of sketches of the similar data chunk to find a sketch that is similar to the sketch of the sub-chunk of the selected data chunk; and delta-encoding the sub-chunk as a reference to the sub-chunk of the similar data chunk and delta-encoding metadata, in response to determining that the sketch of the sub-chunk is similar to the sketch of the sub-chunk of the similar data chunk; otherwise, storing the sub-chunk in an unencoded form. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A system, comprising:
-
a processor; a memory; and a storage module loaded in the memory and executed by the processor to perform operations, the operations including selecting a data chunk comprising sub-chunks, the selected data chunk having a set of fingerprints and a set of sketches corresponding to the sub-chunks of the selected data chunk, the data chunk selected from a first set of data chunks, generating a sketch for the data chunk of the selected data chunk, searching a set of candidate data chunks using the sketch of the selected data chunk to identify a base data chunk, ranking the set of candidate data chunks with at least a minimum degree of similarity by location status data, wherein the location status data indicates a location and status of the base data chunk as in any one of a compressed in a cache status, a decompressed in a cache status, or a compressed in a data storage status, and wherein ranking the set of candidate data chunks using location status data for each candidate prefers a compressed in a cache status over a compressed in a data storage status, loading a set of fingerprints and a set of sketches corresponding to sub-chunks of a similar data chunk based on the ranking, for each sub-chunk of the selected chunk; searching the set of fingerprints for the similar data chunk to find a match to the fingerprint of the sub-chunk; encoding the sub-chunk as a references to a sub-chunk of the similar data chunk, in response determining that the fingerprint of the sub-chunk is identical to the fingerprint of the sub-chunk of the similar data chunk, in response to determining that the fingerprint of the sub-chunk is not identical to a fingerprint in the set of fingerprints of the similar chunk; searching the set of sketches of the similar data chunk to find a sketch that is most similar to the sketch of the sub-chunk; and delta-encoding the sub-chunk as a reference to a sub-chunk of the similar data chunk and delta-encoding metadata, in response to determining that the sketch of the sub-chunk is similar to the sketch of the sub-chunk of the similar data chunk; otherwise, storing the sub-chunk in an unencoded form. - View Dependent Claims (20, 21, 22)
-
Specification