SCALABLE DEDUPLICATION OF STORED DATA

US 20100042790A1
Filed: 08/12/2008
Published: 02/18/2010
Est. Priority Date: 08/12/2008
Status: Active Grant

First Claim

Patent Images

1. A method of scalable deduplication, the method comprising:

identifying a plurality of data blocks in a set of data to be deduplicated;

generating a fingerprint of each of the plurality of data blocks, the fingerprints for use in detecting duplicate data blocks;

defining the set of data to be duplicated as containing a plurality of logical partitions;

assigning each of the plurality of data blocks to exactly one of the logical partitions, including ensuring that data blocks that are duplicates of each other are assigned to the same logical partition by using a portion of the fingerprint of each of the data blocks to determine the logical partition to which the data block should be assigned; and

deduplicating data blocks independently in each of the plurality of logical partitions.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a method and apparatus for scalable deduplication, a data set is partitioned into multiple logical partitions, where each partition can be deduplicated independently. Each data block of the data set is assigned to exactly one partition, so that any two or more data blocks that are duplicates of each are always be assigned to the same logical partition. A hash algorithm generates a fingerprint of each data block in the volume, and the fingerprints are subsequently used to detect possible duplicate data blocks as part of deduplication. In addition, the fingerprints are used to ensure that duplicate data blocks are sent to the same logical partition, prior to deduplication. A portion of the fingerprint of each data block is used as a partition identifier to determine the partition to which the data block should be assigned. Once blocks are assigned to partitions, deduplication can be done on partitions independently.

Citations

20 Claims

1. A method of scalable deduplication, the method comprising:
- identifying a plurality of data blocks in a set of data to be deduplicated;
  
  generating a fingerprint of each of the plurality of data blocks, the fingerprints for use in detecting duplicate data blocks;
  
  defining the set of data to be duplicated as containing a plurality of logical partitions;
  
  assigning each of the plurality of data blocks to exactly one of the logical partitions, including ensuring that data blocks that are duplicates of each other are assigned to the same logical partition by using a portion of the fingerprint of each of the data blocks to determine the logical partition to which the data block should be assigned; and
  
  deduplicating data blocks independently in each of the plurality of logical partitions.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A method as recited in claim 1, wherein said deduplicating comprises using the fingerprints of the plurality of data blocks to detect possible duplicate data blocks.
  - 3. A method as recited in claim 1, wherein using a portion of the fingerprint of each of the data blocks to determine the logical partition to which the data block should be assigned comprises assigning each said data block to the logical partition that has a partition identifier which matches a predefined portion of the fingerprint of the data block.
  - 4. A method as recited in claim 3, wherein each said fingerprint includes a plurality of bits, and wherein the predefined portion of the fingerprint is a predetermined number of most significant bits of the fingerprint.
  - 5. A method as recited in claim 3, wherein each said fingerprint includes a plurality of bits, and wherein the predefined portion of the fingerprint is a predetermined number of least significant bits of the fingerprint.
  - 6. A method as recited in claim 3, wherein each said fingerprint includes a plurality of bits, and wherein the predefined portion of the fingerprint is a predetermined number of bits excluding the most significant bit and the least significant bit of the fingerprint.

7. A method of scalable deduplication, the method comprising:
- identifying a plurality of data blocks in a set of data to be deduplicated in a network storage system;
  
  using a hash algorithm to generate a fingerprint of each of the plurality of data blocks, each said fingerprint including a plurality of bits;
  
  storing the fingerprints of the data blocks in a log, including defining the log to include a first plurality of partitions, assigning each of the first plurality of partitions an identifier, and assigning each fingerprint to exactly one of the first plurality of partitions by assigning each fingerprint to the partition that has an identifier which matches a predefined subset of the plurality of bits of the fingerprint;
  
  defining the set of data to be duplicated as including a plurality of logical partitions, each of the plurality of logical partitions corresponding to a different one of the first plurality of partitions;
  
  associating each of the plurality of data blocks with exactly one of the plurality of logical partitions, including ensuring that data blocks that are duplicates of each other are assigned to the same logical partition, by assigning each of the data blocks to the one of the plurality of logical partitions which corresponds to the one of the first plurality of partitions that contains the fingerprint of the data block; and
  
  deduplicating data blocks independently in each of the plurality of logical partitions, including using the fingerprints of the plurality of data blocks to detect possible duplicate data blocks.
- View Dependent Claims (8, 9, 10)
- - 8. A method as recited in claim 7, wherein the predefined subset of the fingerprint is a predetermined number of most significant bits of the fingerprint.
  - 9. A method as recited in claim 7, wherein the predefined subset of the fingerprint is a predetermined number of least significant bits of the fingerprint.
  - 10. A method as recited in claim 7, wherein the predefined subset of the fingerprint is a predetermined number of bits excluding the most significant bit and the least significant bit of the fingerprint.

11. A network storage system comprisinga processor;
- a network interface through which to communicate with a remote host via a communication link;
  
  a persistent storage subsystem to store a set of data; and
  
  a memory storing code which, when executed by the processor, causes the network storage system to perform a process that comprises;
  
  identifying a plurality of data blocks in the set of data;
  
  generating a fingerprint of each of the plurality of data blocks;
  
  defining the set of data as containing a plurality of logical partitions;
  
  assigning each of the plurality of data blocks to exactly one of the logical partitions, including ensuring that data blocks that are duplicates of each other are assigned to the same logical partition by using a portion of the fingerprint of each of the data blocks to determine the logical partition to which the data block should be assigned; and
  
  deduplicating data blocks independently in each of the plurality of logical partitions, including using the fingerprints of the plurality of data blocks to detect possible duplicate data blocks.
- View Dependent Claims (12, 13, 14, 15)
- - 12. A network storage system as recited in claim 11, wherein using a portion of the fingerprint of each of the data blocks to determine the logical partition to which the data block should be assigned comprises assigning each said data block to the logical partition that has a partition identifier which matches a predefined portion of the fingerprint of the data block.
  - 13. A network storage system as recited in claim 12, wherein each said fingerprint includes a plurality of bits, and wherein the predefined portion of the fingerprint is a predetermined number of most significant bits of the fingerprint.
  - 14. A network storage system as recited in claim 11, wherein each said fingerprint includes a plurality of bits, and wherein the predefined portion of the fingerprint is a predetermined number of least significant bits of the fingerprint.
  - 15. A network storage system as recited in claim 11, wherein each said fingerprint includes a plurality of bits, and wherein the predefined portion of the fingerprint is a predetermined number of bits excluding the most significant bit and the least significant bit of the fingerprint.

16. A system comprising:
- a storage operating system to manage data access operations directed to data stored in a persistent storage facility by a remote host; and
  
  ,a deduplication subsystem, the deduplication subsystem includinga gatherer to identify data blocks that have been modified, of the data;
  
  a fingerprint generator to generate a fingerprint of each of the data blocks;
  
  a fingerprint manager to store each of the fingerprints in a log;
  
  to define the log to include a first plurality of partitions;
  
  to assign each of the first plurality of partitions an identifier; and
  
  to assign each fingerprint to exactly one of the first plurality of partitions by assigning each fingerprint to the partition that has an identifier which matches a predefined subset of the plurality of bits of the fingerprint; and
  
  a deduplication engine to define the data to include a plurality of logical partitions, each of the plurality of logical partitions corresponding to a different one of the first plurality of partitions;
  
  to associate each of the plurality of data blocks with exactly one of the plurality of logical partitions, including ensuring that data blocks that are duplicates of each other are assigned to the same logical partition, by assigning each of the data blocks to the one of the plurality of logical partitions which corresponds to the one of the first plurality of partitions that contains the fingerprint of the data block; and
  
  to deduplicate data blocks of the data independently in each of the plurality of logical partitions.
- View Dependent Claims (17, 18, 19, 20)
- - 17. A system as recited in claim 16, wherein the deduplication engine uses the fingerprints of the plurality of data blocks to detect duplicate data blocks.
  - 18. A system as recited in claim 16, wherein the predefined subset of the fingerprint is a predetermined number of most significant bits of the fingerprint.
  - 19. A system as recited in claim 16, wherein the predefined subset of the fingerprint is a predetermined number of least significant bits of the fingerprint.
  - 20. A system as recited in claim 16, wherein the predefined subset of the fingerprint is a predetermined number of bits excluding the most significant bit and the least significant bit of the fingerprint.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NetApp, Inc.
Original Assignee
NetApp, Inc.
Inventors
Mondal, Shishir, Killamshetti, Praveen

Granted Patent

US 8,086,799 B2
Time in Patent Office

Days
Field of Search
US Class Current

711/161
CPC Class Codes

G06F 11/1453   using de-duplication of the...

G06F 3/0608   Saving storage space on sto...

G06F 3/061   Improving I/O performance

G06F 3/0641   De-duplication techniques

G06F 3/0689   Disk arrays, e.g. RAID, JBOD

SCALABLE DEDUPLICATION OF STORED DATA

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SCALABLE DEDUPLICATION OF STORED DATA

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links