SCALABLE DEDUPLICATION OF STORED DATA
First Claim
1. A method of scalable deduplication, the method comprising:
- identifying a plurality of data blocks in a set of data to be deduplicated;
generating a fingerprint of each of the plurality of data blocks, the fingerprints for use in detecting duplicate data blocks;
defining the set of data to be duplicated as containing a plurality of logical partitions;
assigning each of the plurality of data blocks to exactly one of the logical partitions, including ensuring that data blocks that are duplicates of each other are assigned to the same logical partition by using a portion of the fingerprint of each of the data blocks to determine the logical partition to which the data block should be assigned; and
deduplicating data blocks independently in each of the plurality of logical partitions.
1 Assignment
0 Petitions
Accused Products
Abstract
In a method and apparatus for scalable deduplication, a data set is partitioned into multiple logical partitions, where each partition can be deduplicated independently. Each data block of the data set is assigned to exactly one partition, so that any two or more data blocks that are duplicates of each are always be assigned to the same logical partition. A hash algorithm generates a fingerprint of each data block in the volume, and the fingerprints are subsequently used to detect possible duplicate data blocks as part of deduplication. In addition, the fingerprints are used to ensure that duplicate data blocks are sent to the same logical partition, prior to deduplication. A portion of the fingerprint of each data block is used as a partition identifier to determine the partition to which the data block should be assigned. Once blocks are assigned to partitions, deduplication can be done on partitions independently.
-
Citations
20 Claims
-
1. A method of scalable deduplication, the method comprising:
-
identifying a plurality of data blocks in a set of data to be deduplicated; generating a fingerprint of each of the plurality of data blocks, the fingerprints for use in detecting duplicate data blocks; defining the set of data to be duplicated as containing a plurality of logical partitions; assigning each of the plurality of data blocks to exactly one of the logical partitions, including ensuring that data blocks that are duplicates of each other are assigned to the same logical partition by using a portion of the fingerprint of each of the data blocks to determine the logical partition to which the data block should be assigned; and deduplicating data blocks independently in each of the plurality of logical partitions. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of scalable deduplication, the method comprising:
-
identifying a plurality of data blocks in a set of data to be deduplicated in a network storage system; using a hash algorithm to generate a fingerprint of each of the plurality of data blocks, each said fingerprint including a plurality of bits; storing the fingerprints of the data blocks in a log, including defining the log to include a first plurality of partitions, assigning each of the first plurality of partitions an identifier, and assigning each fingerprint to exactly one of the first plurality of partitions by assigning each fingerprint to the partition that has an identifier which matches a predefined subset of the plurality of bits of the fingerprint; defining the set of data to be duplicated as including a plurality of logical partitions, each of the plurality of logical partitions corresponding to a different one of the first plurality of partitions; associating each of the plurality of data blocks with exactly one of the plurality of logical partitions, including ensuring that data blocks that are duplicates of each other are assigned to the same logical partition, by assigning each of the data blocks to the one of the plurality of logical partitions which corresponds to the one of the first plurality of partitions that contains the fingerprint of the data block; and deduplicating data blocks independently in each of the plurality of logical partitions, including using the fingerprints of the plurality of data blocks to detect possible duplicate data blocks. - View Dependent Claims (8, 9, 10)
-
-
11. A network storage system comprising
a processor; -
a network interface through which to communicate with a remote host via a communication link; a persistent storage subsystem to store a set of data; and a memory storing code which, when executed by the processor, causes the network storage system to perform a process that comprises; identifying a plurality of data blocks in the set of data; generating a fingerprint of each of the plurality of data blocks; defining the set of data as containing a plurality of logical partitions; assigning each of the plurality of data blocks to exactly one of the logical partitions, including ensuring that data blocks that are duplicates of each other are assigned to the same logical partition by using a portion of the fingerprint of each of the data blocks to determine the logical partition to which the data block should be assigned; and deduplicating data blocks independently in each of the plurality of logical partitions, including using the fingerprints of the plurality of data blocks to detect possible duplicate data blocks. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A system comprising:
-
a storage operating system to manage data access operations directed to data stored in a persistent storage facility by a remote host; and
,a deduplication subsystem, the deduplication subsystem including a gatherer to identify data blocks that have been modified, of the data; a fingerprint generator to generate a fingerprint of each of the data blocks; a fingerprint manager to store each of the fingerprints in a log;
to define the log to include a first plurality of partitions;
to assign each of the first plurality of partitions an identifier; and
to assign each fingerprint to exactly one of the first plurality of partitions by assigning each fingerprint to the partition that has an identifier which matches a predefined subset of the plurality of bits of the fingerprint; anda deduplication engine to define the data to include a plurality of logical partitions, each of the plurality of logical partitions corresponding to a different one of the first plurality of partitions;
to associate each of the plurality of data blocks with exactly one of the plurality of logical partitions, including ensuring that data blocks that are duplicates of each other are assigned to the same logical partition, by assigning each of the data blocks to the one of the plurality of logical partitions which corresponds to the one of the first plurality of partitions that contains the fingerprint of the data block; and
to deduplicate data blocks of the data independently in each of the plurality of logical partitions. - View Dependent Claims (17, 18, 19, 20)
-
Specification