Compression of stream data using a hierarchically-indexed database
First Claim
1. A method for compressing data using a hierarchically-indexed database comprising:
- accessing a contiguous block of data of an input data stream;
computing a weak checksum of the contiguous block of data;
computing a hash of the weak checksum;
accessing a hash table of a compression dictionary against the hash of the weak checksum to identify a hash table entry associated with a bucket of one or more data block records, wherein the one or more data block records each correspond to a data block and comprise a strong checksum and a data location of the literal data of the corresponding data block;
if a hash table entry is identified, computing a strong checksum of the block of data and finding a data block record of the identified bucket of data block records having a strong checksum matching the strong checksum of the block of data; and
if a data block record is identified, outputting a reference value for the contiguous block of data.
12 Assignments
0 Petitions
Accused Products
Abstract
The present invention, in particular embodiments, is directed to methods, apparatuses and systems that provide an efficient compression technique for data streams transmitted to storage devices or over networks to remote hosts. Local storage as well as network transmission of streams is made more efficient by awareness and utilization of repeated sequences of data blocks. Such data blocks can be placed in a dictionary on persistent storage and shared across all streams. The dictionary is hierarchically indexed (two or more levels of indexing) to combine high efficiency search with efficient access to the stored data blocks. Additionally, data blocks, in particular implementations, are stored sequentially in order to improve overall performance.
64 Citations
17 Claims
-
1. A method for compressing data using a hierarchically-indexed database comprising:
-
accessing a contiguous block of data of an input data stream; computing a weak checksum of the contiguous block of data; computing a hash of the weak checksum; accessing a hash table of a compression dictionary against the hash of the weak checksum to identify a hash table entry associated with a bucket of one or more data block records, wherein the one or more data block records each correspond to a data block and comprise a strong checksum and a data location of the literal data of the corresponding data block; if a hash table entry is identified, computing a strong checksum of the block of data and finding a data block record of the identified bucket of data block records having a strong checksum matching the strong checksum of the block of data; and if a data block record is identified, outputting a reference value for the contiguous block of data. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for decompressing data using a hierarchically-indexed database comprising:
-
accessing a portion of an input data stream; determining if the portion is a literal data block or a reference value, wherein the reference value comprises a weak checksum and a strong checksum; if the portion is a reference value then; accessing a hash table of a compression dictionary against the hash of the weak checksum to identify a hash table entry associated with a bucket of one or more data block records, wherein the one or more data block records each correspond to a data block and comprise a strong checksum and a data location of the literal data of the corresponding data block; if a hash table entry is identified, accessing the identified bucket of data block records against the strong checksum of the reference value to find a data block record having a matching strong checksum; and if a data block record is found, outputting the literal data of the data block corresponding to the found data block record; if the portion is a literal data block, then outputting the literal data block. - View Dependent Claims (9, 10, 11, 12)
-
-
13. An apparatus for compressing data using a hierarchically-indexed database comprising:
-
one or more processors; one or more network interfaces; a memory; a software application, physically stored in the memory comprising instructions operable to cause the one or more processors and the apparatus to; access a contiguous block of data of an input data stream; compute a weak checksum of the contiguous block of data; compute a hash of the weak checksum; access a hash table of a compression dictionary against the hash of the weak checksum to identify a hash table entry associated with a bucket of one or more data block records, wherein the one or more data block records each correspond to a data block and comprise a strong checksum and a data location of the literal data of the corresponding data block; if a hash table entry is identified, compute a strong checksum of the block of data and finding a data block record having a strong checksum matching the strong checksum of the block of data; and if a data block record is identified, output a reference value for the contiguous block of data. - View Dependent Claims (14, 15)
-
-
16. An apparatus, comprising:
-
a first memory operative to store collected data blocks and data block records; a second memory operative to store written data blocks and corresponding data block records; an index comprising one or more weak checksum entries, wherein each weak checksum entry comprises a weak checksum value and a pointer to a second index entry; and
wherein each weak checksum entry maps to a data block record and corresponding data block in the second memory;a hash table comprising weak checksum hash entries, wherein each weak checksum hash entry corresponds to a bucket of entries in the index and includes a pointer to a weak checksum entry in the index; and a compression module operative to; access a contiguous block of data of an input data stream; compute a weak checksum of the contiguous block of data; access the hash table against a hash of the computed weak checksum to locate a matching weak checksum hash entry; if a match in the hash table is identified, search the index for a matching weak checksum entry using the pointers in the index; if no match is found, output the contiguous block of data; compute weak and strong checksums for the contiguous block of data; create a data block record comprising the weak and strong checksums; aggregate the contiguous block of data and the data block record in the first memory; write one or more blocks of data and corresponding data block records aggregated in the first memory to the second memory upon the occurrence of a condition; and write weak checksums corresponding to the aggregated data blocks to the index as a contiguous sequence of entries. - View Dependent Claims (17)
-
Specification