Method and apparatus for block level data de-duplication

US 8,200,923 B1
Filed: 12/31/2008
Issued: 06/12/2012
Est. Priority Date: 12/31/2008
Status: Active Grant

First Claim

Patent Images

1. A computer storage environment comprising:

at least one chunking/hashing unit that receives input data from at least one source, wherein the at least one chunking/hashing unit processes at least some of the input data to output a plurality of data blocks from the at least some of the input data and a content address for each of the plurality of data blocks, wherein a content address for a corresponding data block is generated based, at least in part, on the content of the corresponding data block; and

a plurality of object addressable storage devices to store at least some of the plurality of data blocks output from the at least one chunking/hashing unit;

wherein the computer storage environment comprises at least one processor programmed to, for each one of the plurality of data blocks output from the at least one chunking/hashing unit, make a determination as to which of the plurality of object addressable storage devices is to control storage of the one of the plurality of data blocks output from the at least one chunking/hashing unit; and

wherein each of the plurality of object addressable storage devices comprises at least one processor programmed to, in response to receipt from the at least one chunking/hashing unit of a received one of the plurality of data blocks;

for received data blocks having content addresses within a particular range, determine whether the received one of the plurality of data blocks is a duplicate of another data block previously stored on the computer storage environment by comparing a content address for the received one of the plurality of data blocks with a data structure including content addresses for data blocks previously stored on the computer storage environment, wherein the size of the particular range is selected to ensure that the data structure including content addresses within the particular range can fit within a memory of the object addressable storage;

control storage of the received one of the plurality of data blocks on the computer storage environment when it is determined that the received one of the plurality of data blocks is not a duplicate of another data block previously stored on the computer storage environment; and

control storage of information indicating that the received one of the plurality of data blocks is represented by data previously stored on the computer storage environment when it is determined that the received one of the plurality of data blocks is a duplicate of another data block previously stored on the computer storage environment.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for performing de-duplication for data blocks in a computer storage environment. At least one chunking/hashing unit receives input data from a source and processes it to output data blocks and content addresses for them. In one aspect, the chunking/hashing unit outputs all blocks without checking to see whether any is a duplicate of a block previously stored on the storage environment. In another aspect, each data block is processed by one of a plurality of distributed object addressable storage (OAS) devices that each is selected to process data blocks having content addresses with a particular range. The OAS devices determine whether each received data block is a duplicate of another previously stored on the computer storage environment, and when it is not, stores the data block.

Citations

20 Claims

1. A computer storage environment comprising:
- at least one chunking/hashing unit that receives input data from at least one source, wherein the at least one chunking/hashing unit processes at least some of the input data to output a plurality of data blocks from the at least some of the input data and a content address for each of the plurality of data blocks, wherein a content address for a corresponding data block is generated based, at least in part, on the content of the corresponding data block; and
  
  a plurality of object addressable storage devices to store at least some of the plurality of data blocks output from the at least one chunking/hashing unit;
  
  wherein the computer storage environment comprises at least one processor programmed to, for each one of the plurality of data blocks output from the at least one chunking/hashing unit, make a determination as to which of the plurality of object addressable storage devices is to control storage of the one of the plurality of data blocks output from the at least one chunking/hashing unit; and
  
  wherein each of the plurality of object addressable storage devices comprises at least one processor programmed to, in response to receipt from the at least one chunking/hashing unit of a received one of the plurality of data blocks;
  
  for received data blocks having content addresses within a particular range, determine whether the received one of the plurality of data blocks is a duplicate of another data block previously stored on the computer storage environment by comparing a content address for the received one of the plurality of data blocks with a data structure including content addresses for data blocks previously stored on the computer storage environment, wherein the size of the particular range is selected to ensure that the data structure including content addresses within the particular range can fit within a memory of the object addressable storage;
  
  control storage of the received one of the plurality of data blocks on the computer storage environment when it is determined that the received one of the plurality of data blocks is not a duplicate of another data block previously stored on the computer storage environment; and
  
  control storage of information indicating that the received one of the plurality of data blocks is represented by data previously stored on the computer storage environment when it is determined that the received one of the plurality of data blocks is a duplicate of another data block previously stored on the computer storage environment.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The computer storage environment of claim 1, wherein the at least one processor on each one of the plurality of object addressable storage devices is programmed to, in response to receipt from the at least one chunking/hashing unit of a received one of the plurality of data blocks:
    - control storage of the received one of the plurality of data blocks on the one of the plurality of object addressable storage devices.
  - 3. The computer storage environment of claim 1, wherein the at least one chunking/hashing unit comprises at least one processor programmed to output every one of the plurality of data blocks to the plurality of object addressable storage devices without making a determination of whether any of the plurality of data blocks is a duplicate of another data block previously stored on the computer storage environment.
  - 4. The computer storage environment of claim 1, wherein the at least one processor in each of the plurality of object addressable storage devices is programmed to, when the content address for the received one of the plurality of data blocks matches a content address for a matching data block previously stored on the computer storage environment, compare the content of the received one of the plurality of data blocks to the content of the matching data block to determine whether the content of the received one of the plurality of data blocks matches the content of the matching data block.
  - 5. The computer storage environment of claim 1, wherein each of the plurality of object addressable storage devices is programmed to make the determination for received data blocks having content addresses within different respective ranges of content addresses.
  - 6. The computer storage environment of claim 1, wherein the at least one processor in each one of the plurality of object addressable storage devices is programmed to determine whether the received one of the plurality of data blocks is a duplicate of another data block previously stored on the one of the plurality of object addressable storage devices by comparing the content address for the received one of the plurality of data blocks with a data structure including content addresses for data blocks previously stored on the one of the plurality of object addressable storage devices.
  - 7. The computer storage environment of claim 1, wherein at least one of the plurality of object addressable storage devices is a disk drive.
  - 8. The computer storage environment of claim 1, wherein the at least one processor in the computer storage environment programmed to make the determination as to which of the plurality of object addressable storage devices is to control storage of the one of the plurality of data blocks output from the at least one chunking/hashing unit is programmed to make the determination based upon the content address of the one of the plurality of data blocks output from the at least one chunking/hashing unit.
  - 9. The computer storage environment of claim 8, wherein the at least one chunking/hashing unit comprises the at least one processor programmed to make the determination as to which of the plurality of object addressable storage devices is to control storage of the one of the plurality of data blocks output from the at least one chunking/hashing unit based upon the content address of the one of the plurality of data blocks output from the at least one chunking/hashing unit.
  - 10. The computer storage environment of claim 1, further comprising the at least one source of the input data, wherein the at least one source comprises at least one backup server configured to back up data stored on at least one primary storage system.
  - 11. The computer storage environment of claim 1, wherein the data structure is a hash table or a tree structure.

12. An object addressable storage system for use in a computer storage environment that includes at least one chunking/hashing unit that receives input data from at least one source and processes at least some of the input data to output a plurality of data blocks from the at least some of the input data and a content address for each of the plurality of data blocks, wherein a content address for a corresponding data block is generated based, at least in part, on the content of the corresponding data block, the storage system comprising:
- at least one storage medium; and
  
  at least one processor programmed to;
  
  provide an object addressable storage interface that receives at least some of the plurality of data blocks output from the at least one chunking/hashing unit; and
  
  in response to receipt from the at least one chunking/hashing unit of a received one of the plurality of data blocks;
  
  determine whether the received one of the plurality of data blocks is a duplicate of another data block previously stored on the object addressable storage system by comparing the content address for the received one of the plurality of data blocks with a data structure including content addresses for data blocks previously stored on the object addressable storage system; and
  
  store the received one of the plurality of data blocks on the at least one storage medium when it is determined that the received one of the plurality of data blocks is not a duplicate of another data block previously stored on the object addressable storage system,wherein the at least one processor is configured to determine whether the received one of the plurality of data blocks is a duplicate of another data block previously stored on the object addressable storage system for received data blocks having content addresses within a particular range, andwherein the size of the range is selected to ensure that the data structure including content addresses within the range can fit within a memory of the object addressable storage system.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. The object addressable storage system of claim 12, wherein the at least one processor is programmed to, when the content address for the received one of the plurality of data blocks matches a content address for a matching data block previously stored on the object addressable storage system, compare the content of the received one of the plurality of data blocks to the content of the matching data block to determine whether the content of the received one of the plurality of data blocks matches the content of the matching data block.
  - 14. The object addressable storage system of claim 12, wherein at least one of the plurality of object addressable storage devices is a disk drive.
  - 15. The object addressable storage system of claim 12, wherein the data structure is a hash table or a tree structure.
  - 16. The object addressable storage system of claim 12, wherein the at least one processor is programmed to control storage of the received one of the plurality of data blocks on the one of the plurality of object addressable storage devices.
  - 17. The object addressable storage system of claim 12, wherein the at least one source comprises at least one backup server configured to back up data stored on at least one primary storage system.

18. A method comprising acts of:
- (A) processing at least some received input data to output a plurality of data blocks from the at least some of the input data and a content address for each of the plurality of data blocks, wherein a content address for a corresponding data block is generated based, at least in part, on the content of the corresponding data block;
  
  (B) processing each one of the plurality of data blocks at one of a plurality of object addressable storage devices, determined from among the plurality of object addressable storage devices based upon the content address of the one the plurality of data blocks being within a particular range, to determine whether the one of the plurality of data blocks is a duplicate of another data block previously stored on the plurality of object addressable storage devices by comparing the content address for the received one of the plurality of data blocks with a data structure including content addresses for data blocks previously stored on the computer storage environment, wherein the size of the particular range is selected to ensure that the data structure including content addresses within the particular range can fit within a memory of the object addressable storage device; and
  
  (C) storing on at least one of the plurality of object addressable storage devices each one of the plurality of data blocks determined in the act (B) to not be a duplicate of another data block previously stored on the plurality of object addressable storage devices without storing a data block of the plurality of data blocks that is determined in the act (B) to be a duplicate of another data block previously stored on the plurality of object addressable storage devices.
- View Dependent Claims (19, 20)
- - 19. The method of claim 18, wherein each of the plurality of object addressable storage devices is programmed to make the determination for data blocks having content addresses within different respective ranges of content addresses.
  - 20. The method of claim 18, wherein the act (B) comprises, when the content address for the one of the plurality of data blocks matches a content address for a matching data block previously stored on the plurality of object addressable storage devices, comparing the content of the one of the plurality of data blocks to the content of the matching data block to determine whether the content of the one of the plurality of data blocks matches the content of the matching data block.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Original Assignee
EMC Corporation (Dell Technologies Inc.)
Inventors
Dunbar, J. Michael, Kallat, Avinash, Healey, Michael W., Fishman, Michael Craig
Primary Examiner(s)
Dudek, Jr., Edward
Assistant Examiner(s)
Verderamo, III, Ralph A

Application Number

US12/347,447
Time in Patent Office

1,259 Days
Field of Search

None
US Class Current

711/162
CPC Class Codes

G06F 11/1453   using de-duplication of the...

G06F 2201/83   the solution involving sign...

G06F 3/0608   Saving storage space on sto...

G06F 3/0641   De-duplication techniques

G06F 3/0683   Plurality of storage devices

Method and apparatus for block level data de-duplication

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for block level data de-duplication

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links