Deduplication featuring variable-size duplicate data detection and fixed-size data segment sharing
First Claim
1. A method comprising:
- receiving a data stream;
dividing the data stream into a plurality of variable-sized sections;
calculating a fingerprint for a first variable-sized section of the plurality of variable-sized sections, whereinthe fingerprint is calculated as a function of all data within the first variable-sized section;
determining whether the fingerprint matches a stored fingerprint;
in response to determining that the fingerprint matches the stored fingerprint, identifying a fixed-length data segment in storage that contains a copy of data in the first variable-sized section, whereinthe fixed-length data segment comprisesthe copy, andadditional data that is not found in the first variable-sized section;
replacing the first variable-sized section with a plurality of references including a reference to the fixed-length data segment; and
updating a reference file to identifythe fixed-length data segment,a first portion of data in the fixed-length data segment that corresponds to the copy, anda second portion of data in the fixed-length data segment that corresponds to the additional data.
4 Assignments
0 Petitions
Accused Products
Abstract
A hybrid deduplication system operates to detect variable-sized deduplication matches, while performing the storage deduplication on fixed-size segments of data. The hybrid deduplication system calculates unique identifiers for variable-sized sections of data within a data stream being written to a deduplicated data store. The hybrid deduplication system then compares those newly-calculated identifiers to identifiers of variable-sized sections of data that have already been stored within the deduplicated data store. If a match is found, the hybrid deduplication system identifies the location of each of the fixed-size data segment(s), already stored in the deduplicated data store, that include the identified variable-sized section of data. Instead of writing the sections that match already-existing sections to the deduplicated data store, the hybrid deduplication system simply causes the creation of a reference to the identified storage locations, indicating that the data stream being written includes the data in these pre-existing storage locations.
-
Citations
9 Claims
-
1. A method comprising:
-
receiving a data stream; dividing the data stream into a plurality of variable-sized sections; calculating a fingerprint for a first variable-sized section of the plurality of variable-sized sections, wherein the fingerprint is calculated as a function of all data within the first variable-sized section; determining whether the fingerprint matches a stored fingerprint; in response to determining that the fingerprint matches the stored fingerprint, identifying a fixed-length data segment in storage that contains a copy of data in the first variable-sized section, wherein the fixed-length data segment comprises the copy, and additional data that is not found in the first variable-sized section; replacing the first variable-sized section with a plurality of references including a reference to the fixed-length data segment; and updating a reference file to identify the fixed-length data segment, a first portion of data in the fixed-length data segment that corresponds to the copy, and a second portion of data in the fixed-length data segment that corresponds to the additional data. - View Dependent Claims (2, 3)
-
-
4. A non-transitory computer readable storage medium comprising program instructions executable by one or more processors to perform a method comprising:
-
receiving a data stream; dividing the data stream into a plurality of variable-sized sections; calculating a fingerprint for a first variable-sized section of the plurality of variable-sized sections, wherein the fingerprint is calculated as a function of all data within the first variable-sized section; determining whether the fingerprint matches a stored fingerprint; in response to determining that the fingerprint matches the stored fingerprint, identifying a fixed-length data segment in storage that contains a copy of data in the first variable-sized section, wherein the fixed-length data segment comprises the copy, and additional data that is not found in the first variable-sized section; replacing the first variable-sized section with a plurality of references including a reference to the fixed-length data segment; and updating a reference file to identify the fixed-length data segment, a first portion of data in the fixed-length data segment that corresponds to the copy, and a second portion of data in the fixed-length data segment that corresponds to the additional data. - View Dependent Claims (5, 6)
-
-
7. A system comprising:
-
one or more processors; and a memory storing program instructions executable by the one or more processors to perform a method comprising; receiving a data stream; dividing the data stream into a plurality of variable-sized sections; calculating a fingerprint for a first variable-sized section of the plurality of variable-sized sections, wherein the fingerprint is calculated as a function of all data within the first variable-sized section; determining whether the fingerprint matches a stored fingerprint; in response to determining that the fingerprint matches the stored fingerprint, identifying a fixed-length data segment in storage that contains a copy of data in the first variable-sized section, wherein the fixed-length data segment comprises the copy, and additional data that is not found in the first variable-sized section; replacing the first variable-sized section with a plurality of references including a reference to the fixed-length data segment; and updating a reference file to identify the fixed-length data segment, a first portion of data in the fixed-length data segment that corresponds to the copy, and a second portion of data in the fixed-length data segment that corresponds to the additional data. - View Dependent Claims (8, 9)
-
Specification