Deduplication featuring variable-size duplicate data detection and fixed-size data segment sharing
First Claim
1. A method comprising:
- storing a plurality of fixed-size data segments on a storage device;
calculating a plurality of stored identifiers, whereina first stored identifier of the plurality of stored identifiers identifies a sub-portion of a first fixed-size data segment of the plurality of fixed-size data segments;
calculating a plurality of identifiers for respective sections of a data stream generated by a client, in response to detecting that the data stream is being written, or is selected to be written, to the storage device, whereinthe data stream comprises two variable-length data segments, andthe plurality of identifiers comprise a first identifier for a first section of the data stream;
detecting that the first identifier matches a first stored identifier; and
in response to the detecting, causing an additional reference to be generated instead of writing the first section of the data stream to the storage device as part of a deduplicated data stream, whereinthe deduplicated data stream is associated with a reference stream,the additional reference is included as part of the reference stream,the additional reference identifies the sub-portion of the first fixed-size data segment as part of the data stream, andthe first fixed-size data segment has a different length than the first section of the data stream, andthe calculating, the detecting, and the causing are performed by a computing device implementing a deduplication module, whereinthe reference stream identifies every fixed-size data segment of the plurality of fixed-size data segments that comprises at least one portion of a variable-length data segment of the two variable-length data segments even if the first fixed-size data segment comprises data that is not part of the variable-length data segment.
7 Assignments
0 Petitions
Accused Products
Abstract
A hybrid deduplication system operates to detect variable-sized deduplication matches, while performing the storage deduplication on fixed-size segments of data. The hybrid deduplication system calculates unique identifiers for variable-sized sections of data within a data stream being written to a deduplicated data store. The hybrid deduplication system then compares those newly-calculated identifiers to identifiers of variable-sized sections of data that have already been stored within the deduplicated data store. If a match is found, the hybrid deduplication system identifies the location of each of the fixed-size data segment(s), already stored in the deduplicated data store, that include the identified variable-sized section of data. Instead of writing the sections that match already-existing sections to the deduplicated data store, the hybrid deduplication system simply causes the creation of a reference to the identified storage locations, indicating that the data stream being written includes the data in these pre-existing storage locations.
-
Citations
23 Claims
-
1. A method comprising:
-
storing a plurality of fixed-size data segments on a storage device; calculating a plurality of stored identifiers, wherein a first stored identifier of the plurality of stored identifiers identifies a sub-portion of a first fixed-size data segment of the plurality of fixed-size data segments; calculating a plurality of identifiers for respective sections of a data stream generated by a client, in response to detecting that the data stream is being written, or is selected to be written, to the storage device, wherein the data stream comprises two variable-length data segments, and the plurality of identifiers comprise a first identifier for a first section of the data stream; detecting that the first identifier matches a first stored identifier; and in response to the detecting, causing an additional reference to be generated instead of writing the first section of the data stream to the storage device as part of a deduplicated data stream, wherein the deduplicated data stream is associated with a reference stream, the additional reference is included as part of the reference stream, the additional reference identifies the sub-portion of the first fixed-size data segment as part of the data stream, and the first fixed-size data segment has a different length than the first section of the data stream, and the calculating, the detecting, and the causing are performed by a computing device implementing a deduplication module, wherein the reference stream identifies every fixed-size data segment of the plurality of fixed-size data segments that comprises at least one portion of a variable-length data segment of the two variable-length data segments even if the first fixed-size data segment comprises data that is not part of the variable-length data segment. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A non-transitory computer readable storage medium comprising program instructions executable by one or more processors to:
-
store a plurality of fixed-size data segments on a storage device; calculate a plurality of stored identifiers, wherein a first stored identifier of the plurality of stored identifiers identifies a sub-portion of a first fixed-size data segment of the plurality of fixed-size data segments; calculate a plurality of identifiers for respective sections of a data stream generated by a client, in response to detecting that the data stream is being written, or is selected to be written, to the storage device, wherein the data stream comprises two variable-length data segments, and the plurality of identifiers comprise a first identifier for a first section of the data stream; detect that the first identifier matches a first stored identifier; and in response to the detecting, causing an additional reference to be generated instead of writing the first section of the data stream to the storage device as part of a deduplicated data stream, wherein the deduplicated data stream is associated with a reference stream, the additional reference is included as part of the reference stream, the additional reference identifies the sub-portion of the first fixed-size data segment as part of the data stream, and the first fixed-size data segment has a different length than the first section of the data stream, and the calculating, the detecting, and the causing are performed by a computing device implementing a deduplication module, wherein the reference stream identifies every fixed-size data segment of the plurality of fixed-size data segments that comprises at least one portion of a variable-length data segment of the two variable-length data segments even if the first fixed-size data segment comprises data that is not part of the variable-length data segment. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. A system comprising:
-
one or more processors; and a memory storing program instructions executable by the one or more processors to; store a plurality of fixed-size data segments on a storage device; calculate a plurality of stored identifiers, wherein a first stored identifier of the plurality of stored identifiers identifies a sub-portion of a first fixed-size data segment of the plurality of fixed-size data segments; calculate a plurality of identifiers for respective sections of a data stream generated by a client, in response to detecting that the data stream is being written, or is selected to be written, to the storage device, wherein the data stream comprises two variable-length data segments, and the plurality of identifiers comprise a first identifier for a first section of the data stream; detect that the first identifier matches a first stored identifier; and in response to the detecting, causing an additional reference to be generated instead of writing the first section of the data stream to the storage device as part of a deduplicated data stream, wherein the deduplicated data stream is associated with a reference stream, the additional reference is included as part of the reference stream, the additional reference identifies the sub-portion of the first fixed-size data segment as part of the data stream, and the first fixed-size data segment has a different length than the first section of the data stream, and the calculating, the detecting, and the causing are performed by a computing device implementing a deduplication module, wherein the reference stream identifies every fixed-size data segment of the plurality of fixed-size data segments that comprises at least one portion of a variable-length data segment of the two variable-length data segments even if the first fixed-size data segment comprises data that is not part of the variable-length data segment. - View Dependent Claims (18, 19, 20, 21, 22)
-
-
23. A method comprising:
-
storing a plurality of fixed-size data segments on a storage device, wherein the plurality of fixed-size data segments comprises a first backup data stream associated with a first client; calculating a plurality of stored identifiers, wherein a first stored identifier of the plurality of stored identifiers identifies a sub-portion of a first fixed-size data segment of the plurality of fixed-size data segments; calculating a plurality of identifiers for respective sections of a backup data stream generated by a client in response to detecting that the second backup data stream is being written, or is selected to be written, to the storage device, wherein the data stream comprises two variable-length data segments, and wherein the second backup data stream is associated with a second client, and wherein the plurality of identifiers comprise a first identifier for a first section of the second backup data stream; detecting that the first identifier matches the first stored identifier; and in response to the detecting, causing an additional reference to be generated instead of writing the first section of a second backup data stream to the storage device as part of a deduplicated data stream, wherein the deduplicated data stream is associated with a reference stream and the additional reference is included as part of the reference stream, wherein the additional reference identifies the sub-portion of the first fixed-size data segment as part of the second backup data stream, wherein the first fixed-size data segment has a different length than the first section of the second backup data stream, and wherein the calculating, the detecting, and the causing are performed by a computing device implementing a deduplication module, wherein the reference stream identifies every fixed-size data segment of the plurality of fixed-size data segments that comprises at least one portion of a variable-length data segment of the two variable-length data segments even if the first fixed-size data segment comprises data that is not part of the variable-length data segment.
-
Specification