DATA DEDUPLICATION FOR STREAMING SEQUENTIAL DATA STORAGE APPLICATIONS

US 20110185149A1
Filed: 01/27/2010
Published: 07/28/2011
Est. Priority Date: 01/27/2010
Status: Active Grant

First Claim

Patent Images

1. A method for data deduplication compression in a streaming storage application, comprising compressing fully sequential data stored in a data repository to a sequential streaming storage, by:

splitting fully sequential data into data blocks;

hashing content of each data block and comparing each hash to an in-memory lookup table for a match, the in-memory lookup table storing all hashes that have been encountered during the compression of the fully sequential data;

for each data block without a hash match, adding the data block as a new data block for compression of fully sequential data; and

encoding duplicate data blocks using the in-memory lookup table into data segments.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Data deduplication compression in a streaming storage application, is provided. The disclosed deduplication process provides a deduplication archive that enables storage of the archive to, and extraction from, a streaming storage medium. One implementation involves compressing fully sequential data stored in a data repository to a sequential streaming storage, by: splitting fully sequential data into data blocks; hashing content of each data block and comparing each hash to an in-memory lookup table for a match, the in-memory lookup table storing all hashes that have been encountered during the compression of the fully sequential data; for each data block without a hash match, adding the data block as a new data block for compression of fully sequential data; and encoding duplicate data blocks using the in-memory lookup table into data segments.

Citations

20 Claims

1. A method for data deduplication compression in a streaming storage application, comprising compressing fully sequential data stored in a data repository to a sequential streaming storage, by:
- splitting fully sequential data into data blocks;
  
  hashing content of each data block and comparing each hash to an in-memory lookup table for a match, the in-memory lookup table storing all hashes that have been encountered during the compression of the fully sequential data;
  
  for each data block without a hash match, adding the data block as a new data block for compression of fully sequential data; and
  
  encoding duplicate data blocks using the in-memory lookup table into data segments.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 further comprising:
    - compressing partially sequential data and data from a random access storage stored in a data repository to the sequential streaming storage, wherein a reconstruction metadata and the in-memory lookup table for data from a randomly accessible storage is stored in a random access storage, the reconstruction metadata enabling listing all files contained in a data deduplication archive without streaming through the sequential streaming storage.
  - 3. The method of claim 2 further comprising:
    - decompressing fully and partially sequential data stored on the sequential streaming storage to the data repository, wherein the reconstruction metadata references previous data blocks and new data blocks, the previous data blocks read and stored in a decompressed output, the new data blocks contained in a current data segment.
  - 4. The method of claim 3 further comprising:
    - decompressing data from the random access storage stored on the sequential streaming storage to the data repository by scanning and decompressing the compressed data from the random access storage, and analyzing apriori information to determine when earlier data is going to be referenced, wherein the earlier data is not included in a partial decompression set.
  - 5. The method of claim 4, further comprising:
    - appending additional data to the data deduplication archive, wherein the in-memory lookup table is restored by reading the in-memory lookup table to identify data blocks in the additional data contained in the data deduplication archive, such that the in-memory lookup table is overwritten with the additional data; and
      
      storing a new in-memory lookup table to enable subsequent appending of additional data.
  - 6. The method of claim 5 wherein encoding the duplicate data blocks further comprises referencing the position and length of the original data block in the sequential data stream using the information from the lookup table.
  - 7. The method of claim 6 wherein each data segment encodes the length of the segment, followed by the reconstruction metadata, and followed by a unique data block.

8. A computer program product for data deduplication compression in a streaming storage application, the computer program product comprising:
- a computer readable storage medium having computer readable program code embodied therewith, wherein the computer readable program when executed on the computer causes the computer to provide a deduplication archive that enables storage of the archive to, and extraction from, a streaming storage medium by;
  
  compressing fully sequential data stored in a data repository to a sequential streaming storage, by;
  
  splitting fully sequential data into data blocks;
  
  hashing content of each data block and comparing each hash to an in-memory lookup table for a match, the in-memory lookup table storing all hashes that have been encountered during the compression of the fully sequential data;
  
  for each data block without a hash match, adding the data block as a new data block for compression of fully sequential data; and
  
  encoding duplicate data blocks using the in-memory lookup table into data segments.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer program product of claim 8 further comprising computer readable program code for performing:
    - compressing partially sequential data and data from a random access storage stored in a data repository to the sequential streaming storage, wherein a reconstruction metadata and the in-memory lookup table for data from the random access storage is stored in the random access storage, the reconstruction metadata enabling listing all files contained in a data deduplication archive without streaming through the sequential streaming storage.
  - 10. The computer program product of claim 9 further comprising computer readable program code for performing:
    - decompressing fully and partially sequential data stored on the sequential streaming storage to the data repository, wherein the reconstruction metadata references previous data blocks and new data blocks, the previous data blocks read and stored in a decompressed output, the new data blocks contained in a current data segment.
  - 11. The computer program product of claim 10 further comprising computer readable program code for performing:
    - decompressing data from the random access storage stored on the sequential streaming storage to the data repository by scanning and decompressing the compressed data from the random access storage, and analyzing apriori information to determine when earlier data is going to be referenced, wherein the earlier data is not included in a partial decompression set.
  - 12. The computer program product of claim 11 further comprising computer readable program code for performing:
    - appending additional data to the data deduplication archive, wherein the in-memory lookup table is restored by reading the in-memory lookup table to identify data blocks in the additional data contained in the data deduplication archive, such that the in-memory lookup table is overwritten with the additional data; and
      
      storing a new in-memory lookup table to enable subsequent appending of additional data.
  - 13. The computer program product of claim 12 further comprising computer readable program code for performing:
    - encoding the duplicate data block by referencing the position and length of the original data block in the sequential data stream using the information from the lookup table.
  - 14. The computer program product of claim 13 wherein each data segment encodes the length of the segment, followed by the reconstruction metadata, and followed by a unique data block.

15. A data deduplication compression system for a streaming storage application, comprising a deduplication module configured for compressing fully sequential data stored in a data repository to a sequential streaming storage, the deduplication module comprising a deduplication compression module configured for:
- splitting fully sequential data into data blocks;
  
  hashing content of each data block and comparing each hash to an in-memory lookup table for a match, the in-memory lookup table storing all hashes that have been encountered during the compression of the fully sequential data;
  
  for each data block without a hash match, adding the data block as a new data block for compression of fully sequential data; and
  
  encoding duplicate data blocks using the in-memory lookup table into data segments.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system of claim 15 wherein the deduplication modules further comprises a short-range compression module configured for:
    - compressing partially sequential data and data from a random access storage stored in a data repository to the sequential streaming storage, wherein a reconstruction metadata and the in-memory lookup table for data from the random access storage is stored in the random access storage, the reconstruction metadata enabling listing all files contained in a data deduplication archive without streaming through the sequential streaming storage.
  - 17. The system of claim 16 wherein the deduplication module further comprises a short-range decompression module configured for:
    - decompressing fully and partially sequential data stored on the sequential streaming storage to the data repository, wherein the reconstruction metadata references previous data blocks and new data blocks, the previous data blocks read and stored in a decompressed output, the new data blocks contained in a current data segment.
  - 18. The system of claim 17 wherein the deduplication module further comprises a deduplication decompression module configured for:
    - decompressing data from the random access storage stored on the sequential streaming storage to the data repository by scanning and decompressing the compressed data from the random access storage, and analyzing apriori information to determine when earlier data is going to be referenced, wherein the earlier data is not included in a partial decompression set.
  - 19. The system of claim 18, wherein the deduplication module is further configured for:
    - appending additional data to the data deduplication archive, wherein the in-memory lookup table is restored by reading the in-memory lookup table to identify data blocks in the additional data contained in the data deduplication archive, such that the in-memory lookup table is overwritten with the additional data; and
      
      storing a new in-memory lookup table to enable subsequent appending of additional data.
  - 20. The system of claim 19 wherein the data deduplication module is further configured for encoding the duplicate data block by referencing the position and length of the original data block in the sequential data stream using the information from the lookup table.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Smith, Mark A., Gruhl, Daniel F., Pieper, Jan H.

Granted Patent

US 8,407,193 B2
Time in Patent Office

Days
Field of Search
US Class Current

711/206
CPC Class Codes

G06F 16/174   Redundancy elimination perf...

G06F 3/0608   Saving storage space on sto...

G06F 3/0641   De-duplication techniques

G06F 3/0682   Tape device

DATA DEDUPLICATION FOR STREAMING SEQUENTIAL DATA STORAGE APPLICATIONS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

DATA DEDUPLICATION FOR STREAMING SEQUENTIAL DATA STORAGE APPLICATIONS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links