Method and apparatus for block level data de-duplication
First Claim
1. A computer storage environment comprising:
- at least one chunking/hashing unit that receives input data from at least one source, wherein the at least one chunking/hashing unit processes at least some of the input data to output a plurality of data blocks from the at least some of the input data and a content address for each of the plurality of data blocks, wherein a content address for a corresponding data block is generated based, at least in part, on the content of the corresponding data block; and
a plurality of object addressable storage devices to store at least some of the plurality of data blocks output from the at least one chunking/hashing unit;
wherein the computer storage environment comprises at least one processor programmed to, for each one of the plurality of data blocks output from the at least one chunking/hashing unit, make a determination as to which of the plurality of object addressable storage devices is to control storage of the one of the plurality of data blocks output from the at least one chunking/hashing unit; and
wherein each of the plurality of object addressable storage devices comprises at least one processor programmed to, in response to receipt from the at least one chunking/hashing unit of a received one of the plurality of data blocks;
for received data blocks having content addresses within a particular range, determine whether the received one of the plurality of data blocks is a duplicate of another data block previously stored on the computer storage environment by comparing a content address for the received one of the plurality of data blocks with a data structure including content addresses for data blocks previously stored on the computer storage environment, wherein the size of the particular range is selected to ensure that the data structure including content addresses within the particular range can fit within a memory of the object addressable storage;
control storage of the received one of the plurality of data blocks on the computer storage environment when it is determined that the received one of the plurality of data blocks is not a duplicate of another data block previously stored on the computer storage environment; and
control storage of information indicating that the received one of the plurality of data blocks is represented by data previously stored on the computer storage environment when it is determined that the received one of the plurality of data blocks is a duplicate of another data block previously stored on the computer storage environment.
9 Assignments
0 Petitions
Accused Products
Abstract
Techniques for performing de-duplication for data blocks in a computer storage environment. At least one chunking/hashing unit receives input data from a source and processes it to output data blocks and content addresses for them. In one aspect, the chunking/hashing unit outputs all blocks without checking to see whether any is a duplicate of a block previously stored on the storage environment. In another aspect, each data block is processed by one of a plurality of distributed object addressable storage (OAS) devices that each is selected to process data blocks having content addresses with a particular range. The OAS devices determine whether each received data block is a duplicate of another previously stored on the computer storage environment, and when it is not, stores the data block.
-
Citations
20 Claims
-
1. A computer storage environment comprising:
-
at least one chunking/hashing unit that receives input data from at least one source, wherein the at least one chunking/hashing unit processes at least some of the input data to output a plurality of data blocks from the at least some of the input data and a content address for each of the plurality of data blocks, wherein a content address for a corresponding data block is generated based, at least in part, on the content of the corresponding data block; and a plurality of object addressable storage devices to store at least some of the plurality of data blocks output from the at least one chunking/hashing unit; wherein the computer storage environment comprises at least one processor programmed to, for each one of the plurality of data blocks output from the at least one chunking/hashing unit, make a determination as to which of the plurality of object addressable storage devices is to control storage of the one of the plurality of data blocks output from the at least one chunking/hashing unit; and wherein each of the plurality of object addressable storage devices comprises at least one processor programmed to, in response to receipt from the at least one chunking/hashing unit of a received one of the plurality of data blocks; for received data blocks having content addresses within a particular range, determine whether the received one of the plurality of data blocks is a duplicate of another data block previously stored on the computer storage environment by comparing a content address for the received one of the plurality of data blocks with a data structure including content addresses for data blocks previously stored on the computer storage environment, wherein the size of the particular range is selected to ensure that the data structure including content addresses within the particular range can fit within a memory of the object addressable storage; control storage of the received one of the plurality of data blocks on the computer storage environment when it is determined that the received one of the plurality of data blocks is not a duplicate of another data block previously stored on the computer storage environment; and control storage of information indicating that the received one of the plurality of data blocks is represented by data previously stored on the computer storage environment when it is determined that the received one of the plurality of data blocks is a duplicate of another data block previously stored on the computer storage environment. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. An object addressable storage system for use in a computer storage environment that includes at least one chunking/hashing unit that receives input data from at least one source and processes at least some of the input data to output a plurality of data blocks from the at least some of the input data and a content address for each of the plurality of data blocks, wherein a content address for a corresponding data block is generated based, at least in part, on the content of the corresponding data block, the storage system comprising:
-
at least one storage medium; and at least one processor programmed to; provide an object addressable storage interface that receives at least some of the plurality of data blocks output from the at least one chunking/hashing unit; and in response to receipt from the at least one chunking/hashing unit of a received one of the plurality of data blocks; determine whether the received one of the plurality of data blocks is a duplicate of another data block previously stored on the object addressable storage system by comparing the content address for the received one of the plurality of data blocks with a data structure including content addresses for data blocks previously stored on the object addressable storage system; and store the received one of the plurality of data blocks on the at least one storage medium when it is determined that the received one of the plurality of data blocks is not a duplicate of another data block previously stored on the object addressable storage system, wherein the at least one processor is configured to determine whether the received one of the plurality of data blocks is a duplicate of another data block previously stored on the object addressable storage system for received data blocks having content addresses within a particular range, and wherein the size of the range is selected to ensure that the data structure including content addresses within the range can fit within a memory of the object addressable storage system. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A method comprising acts of:
-
(A) processing at least some received input data to output a plurality of data blocks from the at least some of the input data and a content address for each of the plurality of data blocks, wherein a content address for a corresponding data block is generated based, at least in part, on the content of the corresponding data block; (B) processing each one of the plurality of data blocks at one of a plurality of object addressable storage devices, determined from among the plurality of object addressable storage devices based upon the content address of the one the plurality of data blocks being within a particular range, to determine whether the one of the plurality of data blocks is a duplicate of another data block previously stored on the plurality of object addressable storage devices by comparing the content address for the received one of the plurality of data blocks with a data structure including content addresses for data blocks previously stored on the computer storage environment, wherein the size of the particular range is selected to ensure that the data structure including content addresses within the particular range can fit within a memory of the object addressable storage device; and (C) storing on at least one of the plurality of object addressable storage devices each one of the plurality of data blocks determined in the act (B) to not be a duplicate of another data block previously stored on the plurality of object addressable storage devices without storing a data block of the plurality of data blocks that is determined in the act (B) to be a duplicate of another data block previously stored on the plurality of object addressable storage devices. - View Dependent Claims (19, 20)
-
Specification