Use of Similarity Hash to Route Data for Improved Deduplication in a Storage Server Cluster

US 20110099351A1
Filed: 10/26/2009
Published: 04/28/2011
Est. Priority Date: 10/26/2009
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, at a storage server cluster, a data segment to be written, the storage server cluster including a plurality of nodes that each store data; and

selecting, at the storage server cluster, one of the nodes to store the received data segment, based on a measure of similarity between a data chunk that contains the received data segment and the data stored in each of the nodes.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique for routing data for improved deduplication in a storage server cluster includes computing, for each node in the cluster, a value collectively representative of the data stored on the node, such as a “geometric center” of the node. New or modified data is routed to the node which has stored data identical or most similar to the new or modified data, as determined based on those values. Each node stores a plurality of chunks of data, where each chunk includes multiple deduplication segments. A content hash is computed for each deduplication segment in each node, and a similarity hash is computed for each chunk from the content hashes of all segments in the chunk. A geometric center of a node is computed from the similarity hashes of the chunks stored in the node.

234 Citations

28 Claims

1. A method comprising:
- receiving, at a storage server cluster, a data segment to be written, the storage server cluster including a plurality of nodes that each store data; and
  
  selecting, at the storage server cluster, one of the nodes to store the received data segment, based on a measure of similarity between a data chunk that contains the received data segment and the data stored in each of the nodes.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. A method as recited in claim 1, wherein said selecting comprises:
    - using a similarity hash in the storage server cluster to identify said one of the plurality of nodes as being most likely to have already stored a data chunk including the received data segment or data similar to the received data segment.
  - 3. A method as recited in claim 1, further comprising:
    - computing a first value from the data chunk that contains the received data segment;
      
      for each of the plurality of nodes, computing a second value from a plurality of data chunks stored in said node;
      
      for each of the plurality of nodes, determining a distance value representing the measure of similarity between the data chunk that contains the received data segment and the data stored in the node, based on the first value and based on the second value of the node.
  - 4. A method as recited in claim 3, wherein the first value is a similarity hash of the data chunk that contains the received data segment, and the second value for each node is a function of similarity hashes of a plurality of data chunks stored on the node, each said data chunk containing a plurality of data segments.
  - 5. A method as recited in claim 4, wherein the second value for each node represents a geometric center of the similarity hashes of the data chunks stored on the node.
  - 6. A method as recited in claim 4, further comprising:
    - computing a coefficient of each of the nodes, the coefficient for each node being a function of a current load on the node; and
      
      for each of the nodes, computing a measure of attraction of the received data segment to the node as a function of the distance value and the coefficient;
      
      wherein said selecting one of the nodes to store the received data segment comprises selecting said one of the nodes based on the measure of attraction of the received data segment to each of the nodes.
  - 7. A method as recited in claim 6, further comprising updating the coefficient and the geometric center of a node of the storage server cluster in response to changes to data stored in the node.
  - 8. A method as recited in claim 6, wherein the measure of attraction of the received data segment to a given node is proportional to the coefficient of the node divided by the distance value.
  - 9. A method as recited in claim 4, wherein each of the data chunks is a deduplication chunk, and wherein for each of the nodes, computing the second value comprises:
    - grouping pluralities of contiguous deduplication segments stored on the node into a plurality of deduplication chunks;
      
      computing a similarity hash for each deduplication chunk based on content hashes of the deduplication segments within the deduplication chunk; and
      
      computing the second value for the node based on the similarity hashes of the deduplication chunks in the node.
  - 10. A method as recited in claim 9, further comprising:
    - maintaining a data location database in the storage server cluster, the data location database including one entry for each deduplication chunk in the storage server cluster, each said entry indicating a node of the storage server cluster in which the corresponding chunk is stored; and
      
      using the data location database in response to a read request to locate a node of the storage server cluster in which requested data is stored.

11. A method comprising:
- for each of a plurality of nodes in a storage server cluster, computing a value representative of the contents of a plurality of segments of data stored on the node;
  
  receiving, at the storage server cluster, a data segment to be written; and
  
  selecting one of the nodes to receive the data segment according to a deduplication criterion, based on the computed values.
- View Dependent Claims (12, 13, 14, 15, 16, 17)
- - 12. A method as recited in claim 11, wherein said selecting comprises:
    - selecting one of the nodes to receive the data segment, based on a measure of similarity between a value computed for a data chunk that contains the data segment and the computed values for the plurality of nodes.
  - 13. A method as recited in claim 11, wherein each said value is computed to represent a geometric center of data stored on the corresponding node.
  - 14. A method as recited in claim 13, wherein selecting one of the nodes to receive the data segment comprises selecting one of the nodes to receive the data segment based on a measure of similarity between a data chunk that contains the data segment and the computed values for the nodes.
  - 15. A method as recited in claim 14, wherein for each of the nodes, computing the value comprises:
    - computing a similarity hash for each of a plurality of data chunks stored in the node, each data chunk containing a plurality of deduplication segments, each similarity hash being a function of content hashes of the deduplication segments within the corresponding data chunk; and
      
      computing the value as a geometric center of all of the similarity hashes computed for the data chunks in the node.
  - 16. A method as recited in claim 11, further comprising:
    - computing a coefficient for each of the nodes, each said coefficient being a function of a current load on an associated one of the nodes; and
      
      automatically moving groups of similar data from current nodes to less loaded nodes based on the coefficients of the nodes.
  - 17. A method as recited in claim 11, further comprising:
    - maintaining a data location database in the storage server cluster, the data location database including a separate entry for each of a plurality of deduplication chunks of data in the storage server cluster, each said entry indicating a node of the storage server cluster in which the corresponding deduplication chunk is stored; and
      
      using the data location database in response to a read request to locate a node of the storage server cluster in which data requested by the read request is stored.

18. A method comprising:
- in each of a plurality of nodes in a storage server cluster,defining a plurality of chunks from data stored in the node;
  
  defining a plurality of contiguous deduplication segments from each of the chunks;
  
  computing a content hash for each of the deduplication segments;
  
  computing a similarity hash for each of the chunks based on the content hashes of the deduplication segments within the chunk; and
  
  computing a geometric center for the node based on the similarity hashes of the chunks in the node;
  
  receiving, at the storage server cluster, a data segment to be written;
  
  computing a similarity hash value of a chunk that contains the received data segment;
  
  computing a distance value for each of the plurality of nodes, based on the similarity hash of the chunk that contains the received data segment and the similarity hashes of the chunks stored on the node; and
  
  selecting one of the nodes of the storage server cluster to receive the data segment, based on the distance value computed for each of the nodes.
- View Dependent Claims (19, 20, 21)
- - 19. A method as recited in claim 18, wherein each distance value represents a measure of similarity between the chunk that contains the received data segment and the data stored in the corresponding node.
  - 20. A method as recited in claim 18, further comprising:
    - computing a coefficient of each of the nodes, the coefficient for each node being a function of a current load on the node;
      
      for each of the nodes, computing a measure of attraction of the received data segment to the node as a function of the distance value and the coefficient;
      
      wherein said selecting one of the nodes to receive the data segment is based on the measure of attraction of the received data segment to each of the nodes.
  - 21. A method as recited in claim 20, wherein the measure of attraction of the received data segment to a given node is proportional to the coefficient of the node divided by the distance value.

22. A storage server cluster comprising:
- a plurality of storage server nodes; and
  
  a processor configured to control at least one of the nodes to execute a process which includesreceiving a data segment to be written;
  
  using a similarity hash to identify one of the plurality of nodes that is most likely to have already stored a data chunk including the received data segment or data similar to the received data segment; and
  
  sending the received data segment to the identified node for storage.
- View Dependent Claims (23, 24)
- - 23. A storage server cluster as recited in claim 22, wherein the process further includes:
    - computing a coefficient for a node as a function of a current load on the node; and
      
      automatically moving a group of similar data from the node to a less loaded node based on the coefficients of the nodes.
  - 24. A storage server cluster as recited in claim 22, wherein the process further includes:
    - maintaining a data location database, the data location database including one entry for each of a plurality of deduplication chunks of data in the storage server cluster, each said entry indicating a node of the storage server cluster in which the corresponding deduplication chunk is stored;
      
      the data location database for use in response to a read request to locate a node of the storage server cluster in which requested data is stored.

25. A storage system comprising:
- a plurality of server nodes, each said node includinga network interface through which to communicate over a network with a storage client;
  
  a storage interface through which to communicate with a non-volatile mass storage facility; and
  
  a processor coupled to the network interface and the storage interface, the processor configured to;
  
  compute a similarity hash for each of a plurality of data chunks stored in the node, wherein each said data chunk contains a plurality of deduplication segments, and wherein each similarity hash is a function of content hashes of the deduplication segments within the corresponding data chunk;
  
  compute a geometric center of the node as a function of the similarity hashes of the plurality of data chunks stored in the node;
  
  compute a new similarity hash based on a data segment to be written, received by the node;
  
  compute a similarity measure for each of the plurality of nodes, each said similarity measure being based on the new similarity hash and a geometric center computed for each of the plurality of nodes; and
  
  select one of the nodes of the storage server cluster to store the received data segment, based on the similarity measure computed for each of the nodes.
- View Dependent Claims (26, 27, 28)
- - 26. A storage system as recited in claim 25, wherein selection of one of the nodes to store the received data segment comprises:
    - identifying said one of the plurality of nodes as being most likely to have already stored a data chunk including the received data segment or data similar to the received data segment.
  - 27. A storage system as recited in claim 25, wherein selection of one of the nodes to store the received data segment comprises:
    - receive a coefficient computed for each of the nodes, each coefficient being a function of a current load on the corresponding node; and
      
      automatically move a group of similar data to another node based on the coefficients of the nodes.
  - 28. A storage system as recited in claim 25, wherein the processor is further configured to:
    - access a data location database, the data location database including one entry for each of a plurality of data chunks stored in the storage server cluster, each said entry indicating which node of the storage system stores the data chunk; and
      
      use the data location database in response to a read request to locate a node of the storage system in which data requested by the read request is stored.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NetApp, Inc.
Original Assignee
NetApp, Inc.
Inventors
Condict, Michael N.

Granted Patent

US 8,321,648 B2
Time in Patent Office

Days
Field of Search
US Class Current

711/216
CPC Class Codes

G06F 16/1752   based on file chunks

G06F 2206/1012   Load balancing

G06F 3/0608   Saving storage space on sto...

G06F 3/0641   De-duplication techniques

G06F 3/067   Distributed or networked st...

H04L 67/1095   Replication or mirroring of...

Use of Similarity Hash to Route Data for Improved Deduplication in a Storage Server Cluster

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

234 Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Use of Similarity Hash to Route Data for Improved Deduplication in a Storage Server Cluster

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

234 Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links