Method and apparatus for data compression
First Claim
1. A data compression method comprising:
- processing a first input stream of uncompressed data for a first file, including dividing the input stream into a plurality of segments, the size of the segments defined by a dividing algorithm determining segment boundaries based on data content;
for each segment, applying a hash to a segment and associating an offset and length with the hashed segment for identifying the location and size of the segment;
identifying whether the segment is unique by comparing the hash of the segment with all other hashes previously stored in a hash table;
storing the hash and corresponding offset and length for the segment into the hash table responsive to determining that the segment is unique;
streaming data for the unique segment into an output stream; and
compressing data in the output stream, wherein an uncompressed segment of a second input stream is appended to a first compressed input stream based upon data in the hash table, including employing the hash table from the first file for creating a unique hash and associated offset for all non-duplicate segments.
1 Assignment
0 Petitions
Accused Products
Abstract
A method, system, and article for compressing an input stream of uncompressed data. The input stream is divided into one or more data segments. A hash is applied to a first data segment, and an offset and length are associated with this first segment. This hash, together with the offset and length data for the first segment, is stored in a hash table. Thereafter, a subsequent segment within the input stream is evaluated and compared with all other hash entries in the hash table, and a reference is written to a prior hash for an identified duplicate segment. The reference includes a new offset location for the subsequent segment. Similarly, a new hash is applied to an identified non-duplicate segment, with the new hash and its corresponding offset stored in the hash table. A compressed output stream of data is created from the hash table retained on storage media.
-
Citations
20 Claims
-
1. A data compression method comprising:
-
processing a first input stream of uncompressed data for a first file, including dividing the input stream into a plurality of segments, the size of the segments defined by a dividing algorithm determining segment boundaries based on data content; for each segment, applying a hash to a segment and associating an offset and length with the hashed segment for identifying the location and size of the segment; identifying whether the segment is unique by comparing the hash of the segment with all other hashes previously stored in a hash table; storing the hash and corresponding offset and length for the segment into the hash table responsive to determining that the segment is unique; streaming data for the unique segment into an output stream; and compressing data in the output stream, wherein an uncompressed segment of a second input stream is appended to a first compressed input stream based upon data in the hash table, including employing the hash table from the first file for creating a unique hash and associated offset for all non-duplicate segments. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for data compression, comprising:
-
a processor in communication with storage media;
a first input stream of data for a first file local to the storage media configured to be processed for compression;a compression manager in communication with the first input stream, the manager to divide the first input stream into a plurality of segments, the size of the segments defined by a dividing algorithm determining segment boundaries based on data content; the compression manager to apply a hash to each segment, and an offset and length identifier associated with the hashed segment to identify the location and size of the hash; a director in communication with the compression manager, the director to identify whether the segment is unique by comparing the hash of the segment within the first file with all other hashes previously stored in a hash table; the compression manager to store the hash and corresponding offset and length for the segment into the hash table responsive to determining that the segment is unique; the compression manager to stream data for the unique segment into an output stream; and the compression manager to compress data in the output stream, wherein an uncompressed segment of a second input stream is appended to a first compressed input stream based upon data in the hash table table, including employing the hash table from the first file to create a unique hash and associated offset for a non-duplicate segment. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. An article for compressing data, comprising:
-
a first input stream of uncompressed data for a first file; a computer readable carrier including computer program instructions configured to compress data of the first file, the instructions comprising;
instructions to divide the input stream into a plurality of segments, the size of the segments defined by a dividing algorithm determining segment boundaries based on data content;for each segment, instructions to apply a hash to the segment and to associate an offset with the hashed segment to identify the location of the hash; instructions to identify whether the segment is unique by comparing the hash of the segment within the same file with all other hashes in the hash table; and
instructions to store the hash and corresponding offset and length for the segment into the hash table responsive to determining that the segment is unique;instructions to stream data for each unique segment into an output stream; and instructions to compress data, wherein an uncompressed segment of a second input stream is appended to a first compressed input stream based upon data in the hash table table, including employing the hash table from the first file for creating a unique hash and associated offset for a non-duplicate segment. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification