System and method for enabling de-duplication in a storage system architecture
First Claim
1. A method for enabling de-duplication in a storage system architecture, the method comprising:
- distributing a plurality of volumes across a plurality of storage servers where the storage servers are interconnected as a cluster;
receiving a write data request to store data at an offset of a file on a first storage server of the plurality of storage servers;
identifying the first storage server that is responsible for the offset of the file;
forwarding the write data request to the first storage server responsible for the offset of the file; and
invoking a data content redirection, by the identified first storage server, to determine, by a hash value on the first storage server, a volume of the first storage server to store the data, the hash value configured to ensure that blocks of data having a same data content are served by a same storage server of the plurality of storage servers to thereby enable data de-duplication.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method enables de-duplication in a storage system architecture comprising one or more volumes distributed across a plurality of nodes interconnected as a cluster. De-duplication is enabled through the use of file offset indexing in combination with data content redirection. File offset indexing is illustratively embodied as a Locate by offset function, while data content redirection is embodied as a novel Locate by content function. In response to input of, inter alia, a data container (file) offset, the Locate by offset function returns a data container (file) index that is used to determine a storage server that is responsible for a particular region of the file. The Locate by content function is then invoked to determine the storage server that actually stores the requested data on disk. Notably, the content function ensures that data is stored on a volume of a storage server based on the content of that data rather than based on its offset within a file. This aspect of the invention ensures that all blocks having identical data content are served by the same storage server so that it may implement de-duplication to conserve storage space on disk and increase cache efficiency of memory.
247 Citations
26 Claims
-
1. A method for enabling de-duplication in a storage system architecture, the method comprising:
-
distributing a plurality of volumes across a plurality of storage servers where the storage servers are interconnected as a cluster; receiving a write data request to store data at an offset of a file on a first storage server of the plurality of storage servers; identifying the first storage server that is responsible for the offset of the file; forwarding the write data request to the first storage server responsible for the offset of the file; and invoking a data content redirection, by the identified first storage server, to determine, by a hash value on the first storage server, a volume of the first storage server to store the data, the hash value configured to ensure that blocks of data having a same data content are served by a same storage server of the plurality of storage servers to thereby enable data de-duplication. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system configured to enable de-duplication in a storage system architecture, the system comprising:
-
a plurality of volumes distributed across a plurality of disk elements, wherein the plurality of disk elements are connected together to form a cluster via a cluster of network elements; a network element configured to receive a request to access a data of a data container served by the cluster; and a first disk element configured to service one or more volumes of the plurality of volumes of the cluster in response to receiving the request from the network element, wherein the network element is further configured to receive a write data request to store data at an offset of a file, execute a locate by offset function to determine the disk element responsible for the offset of the file, and forward the write data request to the disk element responsible for the offset of the file based on the data content of the file; and wherein the first disk element is further configured to execute a locate by content function to determine a storage location of the data, such that the locate by content function determines, by a hash value on the disk element, which disk element data content is currently stored, the hash value configured to ensure that blocks of data having a same data content are served by a same disk element of the plurality of disk elements to thereby enable data de-duplication. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. An apparatus having a plurality of volumes distributed across a plurality of storage servers where the storage servers are interconnected as a cluster, the apparatus configured to enable de-duplication in a storage system architecture, the apparatus comprising:
-
means for receiving a write data request to store data at an offset of a file on a first storage server of the plurality of storage servers; means for identifying the first storage server that is responsible for the offset of the file; means for forwarding the write data request to the first storage server responsible for the offset of the file; and means for invoking a data content redirection to determine, by a hash value on the first storage server, a volume of the first storage server to store the data, the hash value configured to ensure that blocks of data having a same data content are served by a same storage server of the plurality of storage servers to thereby enable data de-duplication. - View Dependent Claims (15, 16, 17, 18)
-
-
19. A computer readable storage medium containing executable program instructions executed by a processor, comprising:
-
program instructions that distribute a plurality of volumes across a plurality of storage servers where the storage servers are interconnected as a cluster program instructions that receive a write data request to store data at an offset of a file on a storage server of the plurality of storage servers; program instructions that identify the first storage server that is responsible for the offset of the file; program instructions that forward the write data request to the identified storage server responsible for the offset of the file; and program instructions that to determine, by a hash value on the first storage server, a volume of the first storage server to store the data, the hash value configured to ensure that blocks of data having a same data content are served by a same storage server of the plurality of storage servers to thereby enable data de-duplication. - View Dependent Claims (20)
-
-
21. A method, comprising:
-
connecting a plurality of nodes together to form a cluster, wherein each node is configured with a plurality of network elements and a plurality of storage elements; storing a plurality of volumes across the plurality of nodes, wherein each volume is a logical arrangement of a plurality of storage devices connected to a storage element; striping a plurality of files across the plurality of volumes, wherein at least one portion of each file is stored on each volume of the plurality of volumes; receiving a data access request for a region of data; locating a first storage element responsible for the region of data by a file offset indexing; locating, by the first storage element, a storage element that physically stores the region of data by a data content redirection, the data content redirection allowing the first storage element to maintain responsibility for the region of the data container regardless of where the data is actually stored; and utilizing the file offset indexing and the data content to enable data de-duplication by ensuring that blocks of data having a same data content are served by a same storage element of the plurality of storage elements. - View Dependent Claims (22, 23)
-
-
24. A method, comprising:
-
connecting a plurality of nodes together to form a cluster, wherein each node is configured with a plurality of network elements and a plurality of storage elements; storing a plurality of volumes across the plurality of nodes, wherein each volume is a logical arrangement of a plurality of storage devices connected to a storage element; striping a plurality of files across the plurality of volumes, wherein at least one portion of each file is stored on each volume of the plurality of volumes; receiving a write data request to store data at an offset of a file; determining a storage element responsible for the offset of the file; forwarding the write data request to the storage element responsible for the offset of the file; and determining, by a hash value on the storage element, a volume of the storage element to store the data, the hash value configured to ensure that blocks of data having a same data content are served by a same storage element of the plurality of storage elements to thereby enable data de-duplication.
-
-
25. A method, comprising:
-
connecting a plurality of nodes together to form a cluster, wherein each node is configured with one or more network elements and one or more storage elements; storing a plurality of volumes across the plurality of nodes, wherein each volume is a logical arrangement of a plurality of storage devices connected to a storage element; striping a plurality of files across the plurality of volumes, wherein at least one portion of each file is stored on each volume of the plurality of volumes; receiving a read data request to retrieve data at an offset of a file; determining a location of the data by both the offset of the file and a hash value, the hash value configured to ensure that identical blocks of data having a same data content are served by a same storage element to thereby enable data de-duplication; and in response to determining the location of the data, retrieving the data to service the read data request.
-
-
26. A system, comprising:
-
a plurality of nodes connected together to form a cluster, wherein each node is configured with a plurality of network elements and a plurality of storage elements; a plurality of volumes stored across the plurality of nodes, wherein each volume is a logical arrangement of a plurality of storage devices connected to a storage element; a plurality of files striped across the plurality of volumes, wherein at least one portion of each file is stored on each volume of the plurality of volumes; and a first node of the plurality of nodes is configured to receive a data access request for a region of data for a first file, locate a disk element responsible for the region of data by a file offset indexing, locate a second disk element that stores the region of data by a data content, and determine, by a hash value on the storage volume, a volume of the plurality of volumes to store the data, the hash value configured to ensure that blocks of data having a same data content are served by the same volume of the plurality of volumes, thereby enabling data de-duplication.
-
Specification