Compression of Stream Data Using a Hierarchically-Indexed Database
First Claim
1. A method for compressing data using a hierarchically-indexed database comprising:
- accessing a contiguous block of data in an input data stream;
computing a weak checksum of the contiguous block of data;
accessing a first memory storing a first level index of a compression dictionary against the weak checksum to identify an entry associated with a bucket of one or more data block records;
if an entry in the first level index is identified, computing a strong checksum of the block of data and accessing a second level index stored in a second memory to identify a data block record of the identified bucket of data block records having a strong checksum matching the strong checksum of the block of data, wherein the second level index comprises one or more entries, wherein each entry of the second level index maps to a data block record, each data block record comprising a strong checksum and a pointer to a literal data block stored in the second memory; and
if a data block record is identified, outputting a reference value for the contiguous block of data.
12 Assignments
0 Petitions
Accused Products
Abstract
The present invention, in particular embodiments, is directed to methods, apparatuses and systems that provide an efficient compression technique for data streams transmitted to storage devices or over networks to remote hosts. Local storage as well as network transmission of streams is made more efficient by awareness and utilization of repeated sequences of data blocks. Such data blocks can be placed in a dictionary on persistent storage and shared across all streams. The dictionary is hierarchically indexed (two or more levels of indexing) to combine high efficiency search with efficient access to the stored data blocks. Additionally, data blocks, in particular implementations, are stored sequentially in order to improve overall performance.
38 Citations
33 Claims
-
1. A method for compressing data using a hierarchically-indexed database comprising:
-
accessing a contiguous block of data in an input data stream; computing a weak checksum of the contiguous block of data; accessing a first memory storing a first level index of a compression dictionary against the weak checksum to identify an entry associated with a bucket of one or more data block records; if an entry in the first level index is identified, computing a strong checksum of the block of data and accessing a second level index stored in a second memory to identify a data block record of the identified bucket of data block records having a strong checksum matching the strong checksum of the block of data, wherein the second level index comprises one or more entries, wherein each entry of the second level index maps to a data block record, each data block record comprising a strong checksum and a pointer to a literal data block stored in the second memory; and if a data block record is identified, outputting a reference value for the contiguous block of data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for compressing data using a hierarchically-indexed database comprising:
-
accessing a contiguous block of data in an input data stream; computing a weak checksum of the contiguous block of data; accessing a first memory storing a first level index of a compression dictionary against the weak checksum to identify an entry associated with a bucket of one or more data block records; if an entry in the first level index is identified, computing a strong checksum of the block of data and accessing a second level index of the compression dictionary to locate a data block record of the identified bucket of data block records having a strong checksum matching the strong checksum of the block of data, wherein the second level index comprises one or more entries, wherein each entry of the second level index maps to a data block record, each data block record comprising a strong checksum and a pointer to a literal data block stored in a second memory; collecting a sequence of contiguous data blocks in the input data stream for which no data block records have been identified; and inserting the contiguous data blocks of the sequence into the compression dictionary, wherein the inserting the contiguous data blocks in the compression dictionary comprises writing literal data of the contiguous data blocks of the sequence to the second memory in proximity. - View Dependent Claims (11, 12, 13, 14)
-
-
15. A method comprising:
-
accessing an input data stream; matching contiguous data blocks in the input stream to one or more data block entries of a compression dictionary maintained in a memory; collecting a sequence of contiguous data blocks in the input data stream to which no entries in the compression dictionary have been matched, wherein the sequence of contiguous data blocks matches an order of the contiguous data blocks in the input data stream; and inserting the contiguous data blocks of the sequence into the compression dictionary maintained in the memory, wherein the inserting the contiguous data blocks in the compression dictionary comprises writing literal data of the contiguous data blocks in proximity according to the collected sequence. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
-
-
33. An apparatus comprising:
-
one or more processors; one or more network interfaces; a memory; a software application, physically stored in the memory, comprising computer readable instructions operable, when executed, to cause the one or more processors and the apparatus to; access an input data stream; match contiguous data blocks in the input stream to one or more data block entries of a compression dictionary maintained in a memory; collect a sequence of contiguous data blocks in the input data stream to which no entries in the compression dictionary have been matched, wherein the sequence of contiguous data blocks matches an order of the contiguous data blocks in the input data stream; and insert the contiguous data blocks of the sequence into the compression dictionary maintained in the memory, wherein the inserting the contiguous data blocks in the compression dictionary comprises writing literal data of the contiguous data blocks in proximity according to the collected sequence.
-
Specification