Cluster storage using subsegmenting

US 8,166,012 B2
Filed: 04/09/2008
Issued: 04/24/2012
Est. Priority Date: 04/11/2007
Status: Active Grant

First Claim

Patent Images

1. A method for storing data on cluster storage comprising:

receiving a data stream or a data block;

breaking the data stream or the data block into segments; and

for each segment associated with the data stream or the data block;

assigning the segment to a cluster node, wherein the cluster node is associated with a cluster storage system comprising at least two cluster nodes and wherein each cluster node is associated with a corresponding storage, wherein the cluster node indexes and stores one or more segments managed by the cluster storage system;

breaking the segment into a plurality of portions of the segment, wherein each portion of the segment is smaller than the segment; and

identifying one of the plurality of portions of the segment that is a duplicate of a portion of another segment already managed by the assigned cluster node for determining storage of a deduplicated representation of the segment in the cluster node, wherein the identification is based at least in part on using a determined tag associated with the portion of the segment, wherein storing the segment includes at least storing a reference to the portion of the other segment already managed by the cluster node instead of the portion of the segment identified as the duplicate, wherein at least the stored reference is used to reconstruct the segment.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Cluster storage is disclosed. A data stream or a data block is received. The data stream or the data block is broken into segments. For each segment, a cluster node is selected, and a portion of the segment smaller than the segment is identified that is a duplicate of a portion of a segment already managed by the cluster node.

31 Citations

View as Search Results

42 Claims

1. A method for storing data on cluster storage comprising:
- receiving a data stream or a data block;
  
  breaking the data stream or the data block into segments; and
  
  for each segment associated with the data stream or the data block;
  
  assigning the segment to a cluster node, wherein the cluster node is associated with a cluster storage system comprising at least two cluster nodes and wherein each cluster node is associated with a corresponding storage, wherein the cluster node indexes and stores one or more segments managed by the cluster storage system;
  
  breaking the segment into a plurality of portions of the segment, wherein each portion of the segment is smaller than the segment; and
  
  identifying one of the plurality of portions of the segment that is a duplicate of a portion of another segment already managed by the assigned cluster node for determining storage of a deduplicated representation of the segment in the cluster node, wherein the identification is based at least in part on using a determined tag associated with the portion of the segment, wherein storing the segment includes at least storing a reference to the portion of the other segment already managed by the cluster node instead of the portion of the segment identified as the duplicate, wherein at least the stored reference is used to reconstruct the segment.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
- - 2. A method as in claim 1, further comprising storing the segment using segment data that is not a duplicate of previously stored data.
  - 3. A method as in claim 1, wherein managing a segment includes enabling finding duplicates for portions of the segment within other segments being managed.
  - 4. A method as in claim 1, wherein managing a segment includes storing the deduplicated representation of the segment.
  - 5. A method as in claim 1, wherein the cluster node manages a portion of the segments stored in the cluster.
  - 6. A method as in claim 1, wherein a subsegment is stored on more than one node so that subsegments can be read sequentially.
  - 7. A method as in claim 1, wherein a subsegment reference is caused to be stored on more than one node so that subsegment references can be read sequentially.
  - 8. A method as in claim 1, wherein breaking the data stream or the data block into segments is based at least in part on content of the data stream or the data block.
  - 9. A method as in claim 1, wherein breaking the data stream or the data block into segments is based at least in part on file boundaries within the data stream or the data block.
  - 10. A method as in claim 1, wherein breaking the data stream or the data block into segments is based at least in part on an anchoring function, wherein the anchoring function includes determining if the computed hash meets one or more predetermined criteria.
  - 11. A method as in claim 1, wherein breaking the data stream or the data block into segments is based at least in part on an anchoring function, wherein the anchoring function includes using a value of a function calculated for a plurality of windows within a segmentation window.
  - 12. A method as in claim 1, wherein breaking the data stream or the data block into segments is based at least in part on an anchoring function, wherein the anchoring function includes establishing a boundary in an algorithmic manner in or around a sliding window of bytes.
  - 13. A method as in claim 1, wherein breaking the data stream or the data block into segments comprises identifying the plurality of subsegments, wherein the plurality of subsegments are contiguous or overlapping, and grouping the plurality of subsegments into segments.
  - 14. A method as in claim 13, wherein identifying the plurality of subsegments comprises identifying the plurality of subsegments within a window and selecting a boundary based at least in part on a hash value of a predetermined number of bytes of each of the plurality of subsegments.
  - 15. A method as in claim 14, wherein the boundary is selected based on a maximum hash value of each of the plurality of subsegments.
  - 16. A method as in claim 14, wherein the boundary is selected based on a minimum hash value of each of the plurality of subsegments.
  - 17. A method as in claim 13, wherein identifying the plurality of subsegments is based at least in part on the data stream or the data block content.
  - 18. A method as in claim 13, wherein identifying the plurality of subsegments includes calculating a function which meets a predetermined condition to select a boundary.
  - 19. A method as in claim 13, wherein identifying the plurality of subsegments includes selecting a boundary using anchors.
  - 20. A method as in claim 13, wherein identifying the plurality of subsegments includes selecting a boundary based at least in part on a minimum value or a maximum value of a function within a window.
  - 21. A method as in claim 1, wherein selecting the cluster node is based at least in part on a hash of at least a portion of a content of the segment.
  - 22. A method as in claim 1, wherein selecting the cluster node is based at least in part on a sketch of the segment.
  - 23. A method as in claim 1, wherein selecting the cluster node is based at least in part on a content tag associated with the segment.
  - 24. A method as in claim 1, wherein selecting the cluster node is based at least in part on at least a portion of a content of the segment.
  - 25. A method as in claim 1, wherein selecting the cluster node is based at least in part on the cluster node'"'"'s remaining storage capacity.
  - 26. A method as in claim 1, further comprising storing one or more tags associated with the segment.
  - 27. A method as in claim 26, wherein the one or more tags include one or more fingerprints.
  - 28. A method as in claim 1, wherein a plurality of tags associated with a plurality of segments are stored together on a cluster node.
  - 29. A method as in claim 1, wherein a plurality of tags associated with a plurality of subsegments are stored together on a cluster node.
  - 30. A method as in claim 1, wherein identifying a portion of the segment smaller than the segment that is a duplicate of a portion of a segment already managed by the cluster node includes identifying one or more previously stored similar segments and determining if an already stored portion of the one or more previously stored similar segments is a duplicate of the portion of the segment.
  - 31. A method as in claim 1, wherein the cluster node includes a summary data structure that is used in the process of assigning a segment to a cluster node.
  - 32. A method as in claim 1, wherein selecting the cluster node is based at least in part on one or more segments that are already stored on the node.
  - 33. A method as in claim 1, wherein selecting the cluster node is based at least in part on one or more similar segments already managed by the node.
  - 34. A method as in claim 1, wherein selecting the cluster node is based at least in part on one or more identical subsegments already managed by the node.
  - 35. A method as in claim 1, wherein the data stream associated with the segment comprises one or more of the following:
    - a file, a plurality of files that are related to each other, a directory of files, or a plurality of segments that are related to each other.
  - 36. A method as in claim 1, wherein the data stream is associated with the segment by providing an indication identifying the data stream to the selected cluster node storing the segment.
  - 37. A method as in claim 1, wherein segments associated with the data stream that are caused to be stored on the selected cluster node are caused to be stored by causing one or more subsegments to be stored wherein the segments associated with the data stream that are caused to be stored on the selected cluster node such that the segments can be retrieved together efficiently.
  - 38. A method as in claim 1, wherein the plurality of subsegments comprising a segment are caused to be stored for efficient retrieval.
  - 39. A method as in claim 1, wherein the plurality of subsegments are caused to be stored by the selected cluster node.
  - 40. A method as in claim 1, wherein the plurality of subsegments are caused to be stored by a replica system.

41. A system for storing data on cluster storage comprising:
- a processor; and
  
  a memory coupled with the processor, wherein the memory is configured to provide the processor with instructions which when executed cause the processor to;
  
  receive a data stream or a data block;
  
  break the data stream or the data block into segments; and
  
  for each segment associated with the data stream or the data block;
  
  assign the segment to a cluster node, wherein the cluster node is associated with a cluster storage system comprising at least two cluster nodes and wherein each cluster node is associated with a corresponding storage, wherein the cluster node indexes and stores one or more segments managed by the cluster storage system;
  
  break the segment into a plurality of portions of the segment, wherein each portion of the segment is smaller than the segment; and
  
  identify one of the plurality of portions of the segment that is a duplicate of a portion of another segment already managed by the assigned cluster node for determining storage of a deduplicated representation of the segment in the cluster node, wherein the identification is based at least in part on using a determined tag associated with the portion of the segment, wherein storing the segment includes at least storing a reference to the portion of the other segment already managed by the cluster node instead of the portion of the segment identified as the duplicate, wherein at least the stored reference is used to reconstruct the segment.

42. A computer program product for storing data on cluster storage, the computer program product being embodied in a computer readable storage medium and comprising computer instructions for:
- receiving a data stream or a data block;
  
  breaking the data stream or the data block into segments; and
  
  for each segment associated with the data stream or the data block;
  
  assigning the segment to a cluster node, wherein the cluster node is associated with a cluster storage system comprising at least two cluster nodes and wherein each cluster node is associated with a corresponding storage, wherein the cluster node indexes and stores one or more segments managed by the cluster storage system file;
  
  breaking the segment into a plurality of portions of the segment, wherein each portion of the segment is smaller than the segment; and
  
  identifying one of the plurality of portions of the segment that is a duplicate of a portion of another segment already managed by the assigned cluster node for determining storage of a deduplicated representation of the segment in the cluster node, wherein the identification is based at least in part on using a determined tag associated with the portion of the segment, wherein storing the segment includes at least storing a reference to the portion of the other segment already managed by the cluster node instead of the portion of the segment identified as the duplicate, wherein at least the stored reference is used to reconstruct the segment.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Original Assignee
EMC Corporation (Dell Technologies Inc.)
Inventors
Reddy, Sazzala Venkata, Maheshwari, Umesh, Lee, Edward K., Patterson, R. Hugo
Primary Examiner(s)
BROWN, SHEREE N

Application Number

US12/082,247
Publication Number

US 20080270729A1
Time in Patent Office

1,476 Days
Field of Search

707/705
US Class Current

707/705
CPC Class Codes

G06F 11/2094 Redundant storage or storag...

G06F 16/285 Clustering or classification

Cluster storage using subsegmenting

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

31 Citations

42 Claims

Specification

Solutions

Use Cases

Quick Links

Cluster storage using subsegmenting

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

31 Citations

42 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links