Method and apparatus for data compression

US 8,380,688 B2
Filed: 11/06/2009
Issued: 02/19/2013
Est. Priority Date: 11/06/2009
Status: Expired due to Fees

First Claim

Patent Images

1. A data compression method comprising:

processing a first input stream of uncompressed data for a first file, including dividing the input stream into a plurality of segments, the size of the segments defined by a dividing algorithm determining segment boundaries based on data content;

for each segment, applying a hash to a segment and associating an offset and length with the hashed segment for identifying the location and size of the segment;

identifying whether the segment is unique by comparing the hash of the segment with all other hashes previously stored in a hash table;

storing the hash and corresponding offset and length for the segment into the hash table responsive to determining that the segment is unique;

streaming data for the unique segment into an output stream; and

compressing data in the output stream, wherein an uncompressed segment of a second input stream is appended to a first compressed input stream based upon data in the hash table, including employing the hash table from the first file for creating a unique hash and associated offset for all non-duplicate segments.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method, system, and article for compressing an input stream of uncompressed data. The input stream is divided into one or more data segments. A hash is applied to a first data segment, and an offset and length are associated with this first segment. This hash, together with the offset and length data for the first segment, is stored in a hash table. Thereafter, a subsequent segment within the input stream is evaluated and compared with all other hash entries in the hash table, and a reference is written to a prior hash for an identified duplicate segment. The reference includes a new offset location for the subsequent segment. Similarly, a new hash is applied to an identified non-duplicate segment, with the new hash and its corresponding offset stored in the hash table. A compressed output stream of data is created from the hash table retained on storage media.

Citations

20 Claims

1. A data compression method comprising:
- processing a first input stream of uncompressed data for a first file, including dividing the input stream into a plurality of segments, the size of the segments defined by a dividing algorithm determining segment boundaries based on data content;
  
  for each segment, applying a hash to a segment and associating an offset and length with the hashed segment for identifying the location and size of the segment;
  
  identifying whether the segment is unique by comparing the hash of the segment with all other hashes previously stored in a hash table;
  
  storing the hash and corresponding offset and length for the segment into the hash table responsive to determining that the segment is unique;
  
  streaming data for the unique segment into an output stream; and
  
  compressing data in the output stream, wherein an uncompressed segment of a second input stream is appended to a first compressed input stream based upon data in the hash table, including employing the hash table from the first file for creating a unique hash and associated offset for all non-duplicate segments.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, further comprising identifying a length of each segment and storing the identified length with the offset for the segment.
  - 3. The method of claim 1, wherein the step of dividing the input stream into a plurality of segments employs a segmenting algorithm selected from the group consisting of:
    - fixed block size, absent content knowledge, and content aware.
  - 4. The method of claim 1, further comprising storing all hash information from the hash table into an archive.
  - 5. The method of claim 4, further comprising appending compression of a second input stream to a compressed and archived first input stream.
  - 6. The method of claim 1, further comprising re-generating hashes from the compressed output stream, including re-processing data stored in a hash table archive to provide original hash information for adding a new file into an existing archive.
  - 7. The method of claim 1, further comprising a user setting the segment size.

8. A system for data compression, comprising:
- a processor in communication with storage media;
  
  a first input stream of data for a first file local to the storage media configured to be processed for compression;
  
  a compression manager in communication with the first input stream, the manager to divide the first input stream into a plurality of segments, the size of the segments defined by a dividing algorithm determining segment boundaries based on data content;
  
  the compression manager to apply a hash to each segment, and an offset and length identifier associated with the hashed segment to identify the location and size of the hash;
  
  a director in communication with the compression manager, the director to identify whether the segment is unique by comparing the hash of the segment within the first file with all other hashes previously stored in a hash table;
  
  the compression manager to store the hash and corresponding offset and length for the segment into the hash table responsive to determining that the segment is unique;
  
  the compression manager to stream data for the unique segment into an output stream; and
  
  the compression manager to compress data in the output stream, wherein an uncompressed segment of a second input stream is appended to a first compressed input stream based upon data in the hash table table, including employing the hash table from the first file to create a unique hash and associated offset for a non-duplicate segment.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, further comprising the compression manager to identify a length of each segment and to store the identified length with the offset for the segment.
  - 10. The system of claim 8, wherein the division of the input stream into a plurality of segments by the manager employs a segmenting algorithm selected from the group consisting of:
    - fixed block size, absent content knowledge, and content aware.
  - 11. The system of claim 8, further comprising the director to store all hash information from the hash table into an archive.
  - 12. The system of claim 11, further comprising the compression manager to write the archive to a storage device.
  - 13. The system of claim 8, further comprising a de-compression manager in communication with the compression manager, the de-compression manager to re-generate a hash from the compressed output stream and to re-process data stored in a hash table archive to provide original hash information for adding a new file into an existing archive.
  - 14. The system of claim 8, further comprising a user to set the segment size.

15. An article for compressing data, comprising:
- a first input stream of uncompressed data for a first file;
  
  a computer readable carrier including computer program instructions configured to compress data of the first file, the instructions comprising;
  
  instructions to divide the input stream into a plurality of segments, the size of the segments defined by a dividing algorithm determining segment boundaries based on data content;
  
  for each segment, instructions to apply a hash to the segment and to associate an offset with the hashed segment to identify the location of the hash;
  
  instructions to identify whether the segment is unique by comparing the hash of the segment within the same file with all other hashes in the hash table; and
  
  instructions to store the hash and corresponding offset and length for the segment into the hash table responsive to determining that the segment is unique;
  
  instructions to stream data for each unique segment into an output stream; and
  
  instructions to compress data, wherein an uncompressed segment of a second input stream is appended to a first compressed input stream based upon data in the hash table table, including employing the hash table from the first file for creating a unique hash and associated offset for a non-duplicate segment.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The article of claim 15, further comprising instructions to identify a length of each segment and to store the identified length with the offset for the segment.
  - 17. The article of claim 15, wherein instructions to divide the input stream into a plurality of segments employs a segmenting algorithm selected from the group consisting of:
    - fixed block size, absent content knowledge, and content aware.
  - 18. The article of claim 15, further comprising instructions to store all hash information from the hash table into an archive.
  - 19. The article of claim 18, further comprising instructions to write the archive to a storage device.
  - 20. The article of claim 15, further comprising instructions to re-generate hashes from the compressed output stream, including instructions to re-process data stored in a hash table archive to provide original hash information for adding a new file into an existing archive.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Gruhl, Daniel F., Pieper, Jan H., Smith, Mark A.
Primary Examiner(s)
COLAN, GIOVANNA B

Application Number

US12/613,597
Publication Number

US 20110113016A1
Time in Patent Office

1,201 Days
Field of Search

711/112, 711/108, 711/216, 726/22, 382/137, 385/295, 709/206, 707/698, 707/693
US Class Current

707/698
CPC Class Codes

H03M 7/30 Compression speech analysis...

Method and apparatus for data compression

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for data compression

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links