Out-of-core similarity matching

US 8,914,338 B1
Filed: 12/22/2011
Issued: 12/16/2014
Est. Priority Date: 12/22/2011
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for data deduplication, the method comprising:

in response to a request for compressing data in a data storage system, partitioning the data into a plurality of data chunks, including a target data chunk and a base data chunk;

generating representative data for the target data chunk and the base data chunk by applying a predetermined algorithm to the target data chunk and the base data chunk;

sorting the representative data for the target data chunk and the base data chunk based on similarity of bit patterns of the target data chunk and the base data chunk to form a sorted representative data list, wherein sorting the representative data further includes dividing representative data of the plurality of data chunks into a plurality of bin files where directly adjacent representative data of the plurality of data chunks are placed in a same bin file, each of the plurality of bin files sized to fit within main memory of a data storage system, reading into main memory each of the plurality of bin files, and comparing and sorting each of the plurality of bin files according to a first feature defined in representative data of the plurality of data chunks;

generating a delta data chunk as the difference between the target data chunk and the base data chunk where the representative data of the target chunk is directly adjacent to the representative data of the base data chunk in the sorted representative data list; and

storing the delta data chunk and the base data chunk in the data storage system, wherein the delta data chunk and the base data chunk represent the target data chunk.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for storing data in a data storage system by partitioning the data into a plurality of data chunks and generating representative data for each of the plurality of chunks by applying a predetermined algorithm to each chunk of the plurality of chunks. Subsequently, the representative data is compared and sorted. Representative data for base data chunks and representative data for other data chunks that can be stored relative to the base data chunks are identified by evaluating the sorted set of representative data. Finally, each of the other data chunks identified as those that can be stored relative to a base data chunk are stored in the data storage system as the difference between the data chunk and a base data chunk.

78 Citations

View as Search Results

21 Claims

1. A computer-implemented method for data deduplication, the method comprising:
- in response to a request for compressing data in a data storage system, partitioning the data into a plurality of data chunks, including a target data chunk and a base data chunk;
  
  generating representative data for the target data chunk and the base data chunk by applying a predetermined algorithm to the target data chunk and the base data chunk;
  
  sorting the representative data for the target data chunk and the base data chunk based on similarity of bit patterns of the target data chunk and the base data chunk to form a sorted representative data list, wherein sorting the representative data further includes dividing representative data of the plurality of data chunks into a plurality of bin files where directly adjacent representative data of the plurality of data chunks are placed in a same bin file, each of the plurality of bin files sized to fit within main memory of a data storage system, reading into main memory each of the plurality of bin files, and comparing and sorting each of the plurality of bin files according to a first feature defined in representative data of the plurality of data chunks;
  
  generating a delta data chunk as the difference between the target data chunk and the base data chunk where the representative data of the target chunk is directly adjacent to the representative data of the base data chunk in the sorted representative data list; and
  
  storing the delta data chunk and the base data chunk in the data storage system, wherein the delta data chunk and the base data chunk represent the target data chunk.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 wherein the base data chunk has multiple features of the representative data similar to multiple features of the representative data of the target data chunk.
  - 3. The method of claim 1 wherein the base data chunk is chosen based on age comparison to the target data chunk.
  - 4. The method of claim 1 wherein the base data chunk is chosen based on physical locality in the data storage system to the target chunk.
  - 5. The method of claim 1, wherein the delta data chunk is stored separately from the target data chunk and the base data chunk.
  - 6. The computer-implemented method of claim 5, wherein the target data chunk is removed from the data storage system after the delta data chunk is generated.
  - 7. The method of claim 1, further comprising:
    - transmitting the delta data chunk and the base data chunk to an auxiliary data storage system.
  - 8. The method of claim 1, further comprising:
    - estimating the compression achievable of delta encoding the target data chunk relative to the base data chunk.
  - 9. The method of claim 1 wherein sorting the representative data occurs for each feature of the representative data such that the representative data in the representative data list is first sorted based on the first feature of the plurality of data chunks and subsequently sorted based on a second feature of the plurality of data chunks, andwherein during each sorting iteration delta data chunks are generated for one or more pairs of data chunks that have directly adjacent representative data in the representative data list.
  - 10. The computer-implemented method of claim 1, wherein generating representative data comprises:
    - inputting a data chunk into a collision-resistant hash function;
      
      receiving from the hash function a hash value for the data chunk;
      
      assigning the hash value to the data chunk.
  - 11. The computer-implemented method of claim 10, wherein generating representative data further comprises:
    - extracting one or more features from each of the plurality of data chunks; and
      
      assigning the features to each of the plurality of data chunks so that representative data includes the features and the hash value.
  - 12. The computer-implemented method of claim 1, wherein the predetermined algorithm extracts one or more features from each of the plurality of chunks.

13. A non-transitory computer-readable storage medium having instructions stored therein, which when executed by a computer, cause the computer to perform a method for data deduplication, the method comprising:
- in response to a request for compressing data in a data storage system, partitioning the data into a plurality of data chunks, including a target data chunk and base data chunk;
  
  generating representative data for the target data chunk and the base data chunk by applying a predetermined algorithm to the target data chunk and the base data chunk;
  
  sorting the representative data for the target data chunk and the base data chunk based on similarity of bit patterns of the target data chunk and the base data chunk to form a sorted representative data list, wherein sorting the representative data further includes dividing representative data of the plurality of data chunks into a plurality of bin files where directly adjacent representative data of the plurality of data chunks are placed in a same bin file, each of the plurality of bin files sized to fit within main memory of a data storage system, reading into main memory each of the plurality of bin files, and comparing and sorting each of the plurality of bin files according to a feature defined in representative data of the plurality of data chunks;
  
  generating a delta data chunk in the data storage system as the difference between the target data chunk and the base data chunk where the representative data of the target chunk is directly adjacent to the representative data of the base data chunk in the sorted representative data list; and
  
  storing the delta data chunk and the base data chunk in the data storage system, wherein the delta data chunk and the base data chunk represent the target data chunk.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. The non-transitory computer-readable storage medium of claim 13 wherein the base data chunk has multiple features of the representative data similar to multiple features of the representative data of the target data chunk,wherein sorting the representative data occurs for each feature of the representative data such that the representative data in the representative data list is first sorted based on a first feature of the plurality of data chunks and subsequently sorted based on a second feature of the plurality of data chunks, andwherein during each sorting iteration delta data chunks are generated for one or more pairs of data chunks that have directly adjacent representative data in the representative data list.
  - 15. The non-transitory computer-readable storage medium of claim 13 wherein the base data chunk has a similar age as the target chunk.
  - 16. The non-transitory computer-readable storage medium of claim 13, wherein the delta data chunk is stored separately from the target data chunk and the base data chunk.
  - 17. The non-transitory computer-readable storage medium of claim 13, wherein the method further comprises:
    - transmitting the delta data chunk and the base data chunk to an auxiliary data storage system.
  - 18. The non-transitory computer-readable storage medium of claim 13, wherein the method further comprises:
    - estimating the compression achievable by delta encoding the target data chunk relative to the base data chunk.
  - 19. The non-transitory computer-readable storage medium of claim 13, wherein the method further comprises:
    - removing the target data chunk from the data storage system after the delta data chunk is generated.

20. A data storage system, comprising:
- a memory unit to store a chunk storage engine, a compression engine, a comparison and sorting module, a similarity matching module and a delta encoding module;
  
  a processor coupled to the memory unit, the processor configured to execute the chunk storage engine, the compression engine, the comparison and sorting module, the similarity matching module, and the delta encoding module,the chunk storage engine to partition data into a plurality of data chunks, including a target data chunk and a base data chunk,the compression engine to generate representative data for the target data chunk and the base data chunk by applying a predetermined algorithm to the target data chunk and the base data chunk,the comparison and sorting module to sort the representative data for the target data chunk and the base data chunk based on similarity of bit patterns of the target data chunk and the base data chunk to form a sorted representative data list, wherein sorting the representative data by the comparison and sorting module further includes dividing representative data of the plurality of data chunks into a plurality of bin files where proximate representative data of the plurality of data chunks are placed in a same bin file, each of the plurality of bin files sized to fit within main memory of a data storage system, reading into main memory each of the plurality of bin files, and comparing and sorting each of the plurality of bin files according to a feature defined in representative data of the plurality of data chunks,the similarity matching module to evaluate where representative data of the target chunk are directly adjacent to representative data of the base data chunk in the sorted representative data list,and the delta encoding module to generate a delta data chunk as the difference between the target data chunk and the base data chunk, wherein the delta data chunk and the base data chunk represent the target data chunk.
- View Dependent Claims (21)
- - 21. The data storage system of claim 20, wherein the target data chunk is removed from the data storage system after the delta data chunk is generated.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Original Assignee
EMC Corporation (Dell Technologies Inc.)
Inventors
Wallace, Grant, Shilane, Philip N., Douglis, Frederick
Primary Examiner(s)
Morrison, Jay
Assistant Examiner(s)
GORTAYO, DANGELINO N

Application Number

US13/335,416
Time in Patent Office

1,090 Days
Field of Search

707/1, 707/200, 707/752, 707/609, 707/661, 707/668, 707/674, 707/681, 707/705, 707/755, 707/790, 707/802, 707/803, 707/812
US Class Current

707/693
CPC Class Codes

G06F 16/11   File system administration,...

G06F 16/174   Redundancy elimination perf...

G06F 16/1744   using compression, e.g. spa...

G06F 16/1748   De-duplication implemented ...

G06F 16/1752   based on file chunks

G06F 16/22   Indexing; Data structures t...

G06F 16/24556   Aggregation; Duplicate elim...

G06F 3/0608   Saving storage space on sto...

G06F 3/0641   De-duplication techniques

G06F 3/0683   Plurality of storage devices

Out-of-core similarity matching

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

78 Citations

21 Claims

Specification

Use Cases

Quick Links

Others

Out-of-core similarity matching

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

78 Citations

21 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others