Use of Similarity Hash to Route Data for Improved Deduplication in a Storage Server Cluster
First Claim
1. A method comprising:
- receiving, at a storage server cluster, a data segment to be written, the storage server cluster including a plurality of nodes that each store data; and
selecting, at the storage server cluster, one of the nodes to store the received data segment, based on a measure of similarity between a data chunk that contains the received data segment and the data stored in each of the nodes.
1 Assignment
0 Petitions
Accused Products
Abstract
A technique for routing data for improved deduplication in a storage server cluster includes computing, for each node in the cluster, a value collectively representative of the data stored on the node, such as a “geometric center” of the node. New or modified data is routed to the node which has stored data identical or most similar to the new or modified data, as determined based on those values. Each node stores a plurality of chunks of data, where each chunk includes multiple deduplication segments. A content hash is computed for each deduplication segment in each node, and a similarity hash is computed for each chunk from the content hashes of all segments in the chunk. A geometric center of a node is computed from the similarity hashes of the chunks stored in the node.
234 Citations
28 Claims
-
1. A method comprising:
-
receiving, at a storage server cluster, a data segment to be written, the storage server cluster including a plurality of nodes that each store data; and selecting, at the storage server cluster, one of the nodes to store the received data segment, based on a measure of similarity between a data chunk that contains the received data segment and the data stored in each of the nodes. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method comprising:
-
for each of a plurality of nodes in a storage server cluster, computing a value representative of the contents of a plurality of segments of data stored on the node; receiving, at the storage server cluster, a data segment to be written; and selecting one of the nodes to receive the data segment according to a deduplication criterion, based on the computed values. - View Dependent Claims (12, 13, 14, 15, 16, 17)
-
-
18. A method comprising:
-
in each of a plurality of nodes in a storage server cluster, defining a plurality of chunks from data stored in the node; defining a plurality of contiguous deduplication segments from each of the chunks; computing a content hash for each of the deduplication segments; computing a similarity hash for each of the chunks based on the content hashes of the deduplication segments within the chunk; and computing a geometric center for the node based on the similarity hashes of the chunks in the node; receiving, at the storage server cluster, a data segment to be written; computing a similarity hash value of a chunk that contains the received data segment; computing a distance value for each of the plurality of nodes, based on the similarity hash of the chunk that contains the received data segment and the similarity hashes of the chunks stored on the node; and selecting one of the nodes of the storage server cluster to receive the data segment, based on the distance value computed for each of the nodes. - View Dependent Claims (19, 20, 21)
-
-
22. A storage server cluster comprising:
-
a plurality of storage server nodes; and a processor configured to control at least one of the nodes to execute a process which includes receiving a data segment to be written; using a similarity hash to identify one of the plurality of nodes that is most likely to have already stored a data chunk including the received data segment or data similar to the received data segment; and sending the received data segment to the identified node for storage. - View Dependent Claims (23, 24)
-
-
25. A storage system comprising:
a plurality of server nodes, each said node including a network interface through which to communicate over a network with a storage client; a storage interface through which to communicate with a non-volatile mass storage facility; and a processor coupled to the network interface and the storage interface, the processor configured to; compute a similarity hash for each of a plurality of data chunks stored in the node, wherein each said data chunk contains a plurality of deduplication segments, and wherein each similarity hash is a function of content hashes of the deduplication segments within the corresponding data chunk; compute a geometric center of the node as a function of the similarity hashes of the plurality of data chunks stored in the node; compute a new similarity hash based on a data segment to be written, received by the node; compute a similarity measure for each of the plurality of nodes, each said similarity measure being based on the new similarity hash and a geometric center computed for each of the plurality of nodes; and select one of the nodes of the storage server cluster to store the received data segment, based on the similarity measure computed for each of the nodes. - View Dependent Claims (26, 27, 28)
Specification