Block-wise adaptive statistical data compressor
First Claim
1. A method of compressing data blocks having a plurality of characters, wherein the plurality of characters form an alphabet of N characters, comprising the steps of:
- assigning the N characters of the alphabet into M super-character groups based upon the expected frequency of occurrence of each of the N characters in the data block, wherein M is less than N;
accumulating statistics in the M super-character groups regarding the frequency of occurrence of each character in the data block;
generating a plurality of super-character codewords that model the frequencies of occurrence for each character, wherein each super-character codeword includes a variable length prefix value that identifies the super-character group and a fixed length index value that identifies the particular character in the group; and
replacing the characters with the super-character codewords to form a compressed data block.
2 Assignments
0 Petitions
Accused Products
Abstract
A block-wise adaptive statistical data compressor is disclosed that operates by replacing characters in a data block with super-character codewords comprising a variable length prefix and a fixed length index. The codewords are determined by treating a plurality of groups of characters as super-character groups and then adapting the codewords, for each data block, based upon the actual frequency of occurrence of the characters in each group. The super-character prefix value identifies the group to which a particular character belongs, and the index value identifies the individual character of the group. By grouping and indexing the characters into these super-character groups, the present invention models a particular data block using a fraction of the information generally required by a fixed statistical compressor. Also disclosed are multi-stage lossless block data compressors that include the block-wise adaptive statistical compressor and also include a clustering stage and a reordering stage. The clustering stage clusters like characters into similar locations within the data block, and the reordering stage reorders the data to generate an expected skew in the frequency distribution of characters in the data block so that the block can be more efficiently compressed by the block-wise adaptive statistical compressor.
-
Citations
29 Claims
-
1. A method of compressing data blocks having a plurality of characters, wherein the plurality of characters form an alphabet of N characters, comprising the steps of:
-
assigning the N characters of the alphabet into M super-character groups based upon the expected frequency of occurrence of each of the N characters in the data block, wherein M is less than N; accumulating statistics in the M super-character groups regarding the frequency of occurrence of each character in the data block; generating a plurality of super-character codewords that model the frequencies of occurrence for each character, wherein each super-character codeword includes a variable length prefix value that identifies the super-character group and a fixed length index value that identifies the particular character in the group; and replacing the characters with the super-character codewords to form a compressed data block. - View Dependent Claims (2, 3, 4)
-
-
5. A block-wise adaptive statistical compressor for compressing a data block having a plurality of characters, the plurality of characters forming an alphabet of N characters comprising:
-
means for assigning the N characters to M super-character groups based upon the expected frequency of occurrence for each character, wherein M is less than N; means for accumulating statistics in the super-character groups regarding the frequency of occurrence of each character in the data block; means for generating a plurality of super-character codewords that model the frequencies of occurrence for each character; and means for replacing the characters with the super-character codewords to form a compressed data block. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12)
-
-
13. A multi-stage data compressor for compressing a data file, comprising:
-
means for partitioning the data file into blocks of characters; a clustering stage for transforming each data block into a clustered block; a reordering stage for reordering each clustered block into a reordered block; and a block-wise adaptive statistical data compressor for compressing the reordered blocks of data, comprising; means for assigning the characters to a plurality of super-character groups based upon the expected frequency of occurrence for each character, wherein at least one of the super-character groups is assigned a plurality of characters; means for accumulating statistics in the super-character groups regarding the frequency of occurrence of each character in the data block; means for generating a plurality of super-character codewords that model the frequencies of occurrence for each character; and means for replacing the characters with the super-character codewords to form a compressed data block. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 24, 25)
-
-
23. The multi-stage data compressor of 13, wherein the means for assigning characters to particular super-character groups assigns a low number of frequently occurring characters to particular super-character groups and a high number of infrequently occurring characters to other super-character groups.
-
26. A method of compressing a data file comprising the steps of:
-
partitioning the data file into a plurality of data blocks, wherein each data block includes a plurality of characters; clustering the characters in the data block so that similar characters are grouped together within the block; reordering the data block by replacing the characters with N numerical values having a skewed frequency distribution; and adaptively compressing each data block by accumulating the numerical values into M super-character groups, wherein M is less than N, and the N numerical values are assigned to the M super-character groups based upon their expected frequency of occurrence in the data block, and generating super-character codewords that replace the numerical values in order to compress the data block. - View Dependent Claims (27)
-
-
28. A method of compressing a data file, comprising:
-
partitioning the data file into data blocks containing N bytes; for each data block in the data file; clustering the N bytes in the data block using a clustering algorithm; reordering the N bytes of data in the clustered data block; adaptively compressing the reordered data block using a statistically coder; determining whether the adaptively compressed data block is smaller than the original data block; and if the adaptively compressed data block is smaller than the original data block, then outputting a header byte indicating that the data block is compressed along with the compressed data block, else outputting a header byte indicating that the data block is not compressed along with the original data block.
-
-
29. A method of compressing a data block comprising a plurality of characters, wherein the plurality of characters are associated with an alphabet of N characters, comprising the steps of:
-
forming a super-character counting array having M elements, wherein each element of the super-character counting array is a super-character group associated with one or more characters in the alphabet, and wherein M is less than N; accumulating statistics in the super-character counting array regarding the frequency of occurrence of the characters in the data block by incrementing the elements in the array based on the occurrence of a particular character in the data block that is associated with a particular super-character group; selecting a normalization value for one of the elements in the array and normalizing the other elements in the array based on the normalization value; generating super-character codewords for each character associated with the super-character groups by selecting a variable length prefix value and a fixed length index value for each character, wherein the length of the variable length prefix value when combined with the length of the fixed length index value is less than the length of an uncompressed character; and compressing the data block by replacing the characters with the super-character codewords.
-
Specification