SYSTEM AND METHOD FOR ORGANIZING DATA TO FACILITATE DATA DEDUPLICATION
First Claim
1. A method comprising:
- dividing a set of data which is defined as a plurality of blocks into a plurality of chunks in a network storage system, wherein boundaries of the chunks are independent of boundaries of the blocks; and
storing metadata of the set of data, including pointers for locating the data, in a hierarchical structure in the network storage system, the hierarchical structure including a plurality of levels, each level including at least one node;
wherein a lowest level of the plurality of levels includes a plurality of nodes that each contain chunk metadata, and in each said node that contains chunk metadata, the chunk metadata identifies at least one of the chunks.
1 Assignment
0 Petitions
Accused Products
Abstract
A technique for organizing data to facilitate data deduplication includes dividing a block-based set of data into multiple “chunks”, where the chunk boundaries are independent of the block boundaries (due to the hashing algorithm). Metadata of the data set, such as block pointers for locating the data, are stored in a tree structure that includes multiple levels, each of which includes at least one node. The lowest level of the tree includes multiple nodes that each contain chunk metadata relating to the chunks of the data set. In each node of the lowest level of the buffer tree, the chunk metadata contained therein identifies at least one of the chunks. The chunks (user-level data) are stored in one or more system files that are separate from the buffer tree and not visible to the user.
393 Citations
24 Claims
-
1. A method comprising:
-
dividing a set of data which is defined as a plurality of blocks into a plurality of chunks in a network storage system, wherein boundaries of the chunks are independent of boundaries of the blocks; and storing metadata of the set of data, including pointers for locating the data, in a hierarchical structure in the network storage system, the hierarchical structure including a plurality of levels, each level including at least one node; wherein a lowest level of the plurality of levels includes a plurality of nodes that each contain chunk metadata, and in each said node that contains chunk metadata, the chunk metadata identifies at least one of the chunks. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method of storing data in a network storage system to facilitate data deduplication, the method comprising:
-
determining a plurality of anchor points for a set of data defined as a plurality of blocks in the network storage system, wherein the anchor points are independent of boundaries of the plurality of blocks; dividing the set of data into a plurality of chunks according to the plurality of anchor points; writing the plurality of chunks into a plurality of chunk files; storing metadata including block pointers of the set of data in a hierarchical structure in the network storage system, the hierarchical structure including a plurality of levels, each said level including at least one node, wherein a lowest level of the plurality of levels includes a plurality of nodes that each store chunk metadata, wherein in each said node that contains chunk metadata the chunk metadata identifies at least one chunks in the plurality chunk files, wherein the nodes in the lowest level of the hierarchical structure do not contain any portion of the set of data; and sharing a chunk, of the plurality of chunks, between two files to reduce duplication of data in said chunk, wherein each of the two files is represented by a hierarchical structure that includes a lowest-level node that includes chunk metadata identifying the shared chunk. - View Dependent Claims (10)
-
-
11. A method comprising:
-
receiving at a network storage server a first request for data stored in a file system of the network storage server, wherein the data is part of a set of data defined in terms of a plurality of blocks, the first request specifying a file block number of the data and a root node identifier of a root node containing metadata of the data; in response to the first request, retrieving the data from a stable storage of the network storage server into a buffer cache of the network storage server and sending the data to a requester; receiving a second request for said data at the network storage server, the second request specifying a file block number of the data and a root node identifier of a root node containing metadata of the data, wherein the file block number and the root node identifier specified by the second request are different from, respectively, the file block number and the root node identifier specified by the first request; and in response to the second request, determining that the data is already in the buffer cache, and providing the data from the buffer cache to a sender of the second request without having to reload the data into the buffer cache. - View Dependent Claims (12)
-
-
13. A storage controller comprising:
-
a communication interface through which to communicate with a storage client over a network; a storage interface through which to communicate with a stable storage facility; a processor coupled to the communication interface and the storage interface; and a storage medium containing code which, when executed by the processor, causes the storage controller to perform a process that includes dividing a set of data into a plurality of chunks according to a plurality of anchor points, the data set including a plurality of blocks, wherein the anchor points are independent of boundaries of the blocks, and storing metadata, including pointers for locating the data, in a hierarchical structure, the hierarchical structure including a plurality of levels, each level including at least one node, wherein a lowest level of the plurality of levels includes a plurality of nodes that each contain chunk metadata, wherein in each said node that contains chunk metadata, the chunk metadata identifies at least one of the chunks. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
-
21. A network storage system comprising:
-
means for communicating with a plurality of storage clients over a network; means for determining a plurality of anchor points for a set of data defined as a plurality of blocks; means for dividing the set of data into a plurality of chunks according to the plurality of anchor points, wherein boundaries of the chunks are independent of boundaries of the plurality of blocks; means for writing the plurality of chunks into a plurality of chunk files; means for storing metadata including block pointers of the set of data in a hierarchical structure in the network storage system, the hierarchical structure including a plurality of levels, each said level including at least one node, wherein a lowest level of the plurality of levels includes a plurality of nodes that each store chunk metadata, wherein in each said node that contains chunk metadata the chunk metadata identifies at least one chunk in the plurality chunk files; and means for sharing a chunk, of the plurality of chunks, between two files to reduce duplication of data in said chunk, wherein each of the two files is represented by a hierarchical structure that includes a lowest-level node that includes chunk metadata identifying the shared chunk. - View Dependent Claims (22)
-
-
23. A machine-readable storage medium storing instructions which, when executed by a machine, cause the machine to perform a method of storing data in a network storage system to facilitate data deduplication, the method comprising:
-
determining a plurality of anchor points for a set of data defined as a plurality of blocks in the network storage system, wherein the anchor points are independent of boundaries of the plurality of blocks; dividing the set of data into a plurality of chunks according to the plurality of anchor points; writing the plurality of chunks into a plurality of chunk files; storing metadata including block pointers of the set of data in a hierarchical structure in the network storage system, the hierarchical structure including a plurality of levels, each said level including at least one node, wherein a lowest level of the plurality of levels includes a plurality of nodes that each store chunk metadata, wherein in each said node that contains chunk metadata the chunk metadata identifies at least one chunk in the plurality chunk files; and sharing a chunk, of the plurality of chunks, between two files to reduce duplication of data in said chunk, wherein each of the two files is represented by a hierarchical structure that includes a lowest-level node that includes chunk metadata identifying the shared chunk. - View Dependent Claims (24)
-
Specification