Global de-duplication in shared architectures

US 8,190,835 B1
Filed: 12/31/2007
Issued: 05/29/2012
Est. Priority Date: 12/31/2007
Status: Active Grant

First Claim

Patent Images

1. A method for globally de-duplicating data inline in a shared architecture, the method comprising:

receiving a digital sequence for storage on a first storage system in a network that includes the first storage system and one or more additional storage systems, wherein the first storage system and each of the one or more additional storage systems include a de-duplication client, wherein the first storage system includes original data and at least a snapshot of the original data;

determining that the digital sequence includes at least one block of data that is not stored in the first storage system by the de-duplication client of the first storage system;

determining that the at least one block of data is a duplicate of a block of data already stored on one of the one or more additional storage systems, wherein the de-duplication client of the first storage system cooperates with a de-duplication server to determine that the at least one block of data is a duplicate of a block of data already stored on one of the one or more additional storage systems; and

storing, on the first storage system, a pointer or reference that points to the block of data already stored on the one of the one or more additional storage systems, wherein the at least one block of data is not stored on the first storage system, wherein a single instance of the at least one block of data is used for the original data and the snapshot in the first storage system and in the one or more additional storage systems.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Redundant data is globally de-duplicated across a shared architecture that includes a plurality of storage systems. The storage systems implement copy-on-write or WAFL to generate snapshots of original data. Each storage system includes a de-duplication client to identify and reduce redundant original and/or snapshot data on the storage system. Each de-duplication client can de-duplicate a digital sequence by breaking the sequence into blocks and identifying redundant blocks already stored in the shared architecture. Identifying redundant blocks may include hashing each block and comparing the hash to a local and/or master hash table containing hashes of existing data. Once identified, redundant data previously stored is deleted (e.g., post-process de-duplication), or redundant data is not stored to begin with (e.g., inline de-duplication). In both cases, pointers to shared data blocks can be used to reassemble the digital sequence where one or more blocks were deleted or not stored on the storage system.

152 Citations

20 Claims

1. A method for globally de-duplicating data inline in a shared architecture, the method comprising:
- receiving a digital sequence for storage on a first storage system in a network that includes the first storage system and one or more additional storage systems, wherein the first storage system and each of the one or more additional storage systems include a de-duplication client, wherein the first storage system includes original data and at least a snapshot of the original data;
  
  determining that the digital sequence includes at least one block of data that is not stored in the first storage system by the de-duplication client of the first storage system;
  
  determining that the at least one block of data is a duplicate of a block of data already stored on one of the one or more additional storage systems, wherein the de-duplication client of the first storage system cooperates with a de-duplication server to determine that the at least one block of data is a duplicate of a block of data already stored on one of the one or more additional storage systems; and
  
  storing, on the first storage system, a pointer or reference that points to the block of data already stored on the one of the one or more additional storage systems, wherein the at least one block of data is not stored on the first storage system, wherein a single instance of the at least one block of data is used for the original data and the snapshot in the first storage system and in the one or more additional storage systems.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the storage systems implement copy-on-write to generate snapshots of original data, a WAFL file system, or a combination of copy-on-write and a WAFL file system.
  - 3. The method of claim 1, wherein the first storage system implements a WAFL file system to generate snapshots of original data stored on the first storage system by copying a root inode of the first storage system to create snapshot inodes, the root inode and snapshot inodes each pointing to one or more file inodes and one or more data blocks.
  - 4. The method of claim 3, wherein storing a pointer or reference that points to the block of data already stored on the one of the one or more additional storage systems comprises modifying one or more of the root inode, a snapshot inode, and a file inode to point to the block of data already stored on the one of the one or more additional storage systems.
  - 5. The method of claim 1, wherein determining that the at least one block of data is a duplicate of a block of data already stored on one of the one or more additional storage systems comprises:
    - breaking the digital sequence into a plurality of blocks of data that include the at least one block of data;
      
      performing a hash function on the at least one block of data to obtain a hash value of the at least one block of data;
      
      querying the de-duplication server with the hash value of the at least one block of data, wherein the de-duplication server compares the hash value of the at least one block of data to hash values of existing blocks of data stored on the storage systems; and
      
      receiving a response from the de-duplication server indicating that the at least one block of data is a duplicate of the block of data already stored on the one of the one or more additional storage systems and identifying a location of the block of data already stored on the one of the one or more additional storage systems.
  - 6. The method of claim 1, further comprising:
    - determining that the digital sequence includes one or more blocks of data that are duplicates of one or more blocks of data already stored on the first storage system; and
      
      storing one or more pointers or references that point to the one or more blocks of data already stored on the first storage system such that the duplicate one or more blocks of data need not be stored again on the first storage system.
  - 7. The method of claim 1, wherein each of the storage systems comprises a file server, a filer, or a storage array.

8. A method for globally de-duplicating data post-process in a shared architecture, the method comprising:
- storing a digital sequence on a first storage system in a network that includes the first storage system and one or more additional storage systems, wherein each of the first storage system and the one or more additional storage systems include a de-duplication client, wherein the first storage system includes original data and at least a snapshot of the original data;
  
  determining that the digital sequence includes at least one block of data that is not already stored in the first storage system by the de-duplication client of the first storage system;
  
  determining that the at least one block of data is a duplicate of a block of data stored on one of the one or more additional storage systems, wherein the de-duplication client of the first storage system cooperates with a de-duplication server to determine that the at least one block of data is a duplicate of a block of data already stored on one of the one or more additional storage systems;
  
  deleting the at least one block of data from the first storage system; and
  
  storing, on the first storage system, a pointer or reference that points to the block of data stored on the one of the one or more additional storage systems, wherein a single instance of the at least one block of data is used for the original data and the snapshot in the first storage system.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The method of claim 8, wherein the first storage system implements a WAFL file system to generate snapshots of original data stored on the first storage system by copying a root inode of the first storage system to create snapshot inodes, the root inode and snapshot inodes each pointing to one or more file inodes and one or more data blocks.
  - 10. The method of claim 9, wherein storing a pointer or reference that points to the block of data stored on the one of the one or more additional storage systems comprises modifying one or more of the root inode, a snapshot inode, and a file inode to point to the block of data stored on the one of the one or more additional storage systems.
  - 11. The method of claim 8, wherein the storage systems implement copy-on-write to generate snapshots of original data stored on the storage systems, a WAFL file system, or a combination of copy-on-write and a WAFL file system.
  - 12. The method of claim 8, wherein determining that the at least one block of data is a duplicate of a block of data stored on one of the one or more additional storage systems comprises:
    - breaking the digital sequence into a plurality of blocks of data that include the at least one block of data;
      
      performing a hash function on the at least one block of data to obtain a hash value of the at least one block of data;
      
      querying the de-duplication server with the hash value of the at least one block of data, wherein the de-duplication server compares the hash value of the at least one block of data to hash values of existing blocks of data stored on the storage systems; and
      
      receiving a response from the de-duplication server indicating that the at least one block of data is a duplicate of the block of data stored on the one of the one or more additional storage systems and identifying a location of the block of data stored on the one of the one or more additional storage systems.
  - 13. The method of claim 8, wherein each of the storage systems comprises a file server, a filer, or a storage array.
  - 14. The method of claim 8, further comprising:
    - determining that the digital sequence includes one or more blocks of data that are duplicates of one or more blocks of data already stored on the first storage system;
      
      deleting the one or more blocks of data from the first storage system; and
      
      storing one or more pointers or references that point to the one or more blocks of data already stored on the first storage system such that the duplicate one or more blocks of data need not be stored again on the first storage system.

15. A system for reducing redundant data across a plurality of storage systems, the system comprising:
- a de-duplication server maintaining a master table or index of data stored on a plurality of storage systems; and
  
  a plurality of de-duplication clients each operating on a corresponding one of the plurality of storage systems to de-duplicate redundant data either stored on or being written to a corresponding storage system relative to data already stored in the plurality of storage systems, wherein the plurality of storage systems includes original data and at least one snapshot of the original data wherein each de-duplication client maintains a local table or index of data for the corresponding storage system, wherein each de-duplication client uses the local table and each de-duplication client coordinates with the de-duplication server to use the master table to de-duplicate the redundant data across the plurality of storage systems, wherein a pointer or reference is used to point to data on the other storage systems when data is determined to be redundant, wherein a single instance of the each block of the data that has been de-duplicated is used for the original data and the snapshots across the plurality of storage systems.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system of claim 15, wherein one or more policies can be defined for each de-duplication client that limit the data that each de-duplication client processes, and wherein each policy is based on one or more of a volume, directory, or file associated with the data.
  - 17. The system of claim 15, wherein each de-duplication client maintains the local table or index of data stored on a corresponding storage system.
  - 18. The system of claim 17, wherein each de-duplication client uses its local table or index to de-duplicate redundant data locally on a corresponding storage system.
  - 19. The system of claim 17, wherein the master table or index and the local table or index both represent stored data using hash values or digital signatures of the stored data.
  - 20. The system of claim 15, wherein each de-duplication client de-duplicates redundant data inline as it is being written to a corresponding storage system or post process after it has been stored in a corresponding storage system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Original Assignee
EMC Corporation (Dell Technologies Inc.)
Inventors
Yueh, Jedidiah
Primary Examiner(s)
Rutz, Jared
Assistant Examiner(s)
Bertram, Ryan

Application Number

US11/968,048
Time in Patent Office

1,611 Days
Field of Search

711/162, 711/154, 711/159, 711/170
US Class Current

711/162
CPC Class Codes

G06F 11/1453   using de-duplication of the...

G06F 12/00   Accessing, addressing or al...

G06F 16/1748   De-duplication implemented ...

G06F 3/0608   Saving storage space on sto...

G06F 3/0641   De-duplication techniques

G06F 3/067   Distributed or networked st...

Global de-duplication in shared architectures

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

152 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Global de-duplication in shared architectures

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

152 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links