Method for compressed data with reduced dictionary sizes by coding value prefixes
First Claim
1. A computer-based method to compress a dataset of values from one or more domains, said computer-based method implemented in computer readable program code stored in computer memory, said computer-based method comprising the steps of:
- a. receiving input data;
b. identifying a dictionary size budget;
c. scanning a sample of input data values to identify occurrence frequencies of each value prefix in said sample;
d. based on said identified dictionary size budget in (b) and said identified occurrence frequencies in (c), choosing a subset of value prefixes for which codewords are to be assigned, wherein a choice of assigning prefix codes to common prefixes associated with a subset of said set of data values is optimized to provide maximum compression for a given constraint on the number of codewords;
e. building a dictionary and assigning prefix codes to said chosen subset of value prefixes in (d);
f. rescanning said input data wherein, for each given value v, finding its longest prefix u in said dictionary, chosen in step (e) and outputting a concatenation of codeword(u) and w, wherein w is the suffix of value v, said suffix comprising remaining uncoded bits of said given value, andwherein said optimization is performed via a dynamic programming algorithm, wherein, in said dynamic programming algorithm, best prefixes are chosen based on whether a given node v is a leaf node or whether the given node v is not a leaf node.
2 Assignments
0 Petitions
Accused Products
Abstract
The speed of dictionary based decompression is limited by the cost of accessing random values in the dictionary. If the size of the dictionary can be limited so it fits into cache, decompression is made to be CPU bound rather than memory bound. To achieve this, a value prefix coding scheme is presented, wherein value prefixes are stored in the dictionary to get good compression from small dictionaries. Also presented is an algorithm that determines the optimal entries for a value prefix dictionary. Once the dictionary fits in cache, decompression speed is often limited by the cost of mispredicted branches during Huffman code processing. A novel way is presented to quantize Huffman code lengths to allow code processing to be performed with few instructions, no branches, and very little extra memory. Also presented is an algorithm for code length quantization that produces the optimal assignment of Huffman codes and show that the adverse effect of quantization on the compression ratio is quite small.
268 Citations
15 Claims
-
1. A computer-based method to compress a dataset of values from one or more domains, said computer-based method implemented in computer readable program code stored in computer memory, said computer-based method comprising the steps of:
-
a. receiving input data; b. identifying a dictionary size budget; c. scanning a sample of input data values to identify occurrence frequencies of each value prefix in said sample; d. based on said identified dictionary size budget in (b) and said identified occurrence frequencies in (c), choosing a subset of value prefixes for which codewords are to be assigned, wherein a choice of assigning prefix codes to common prefixes associated with a subset of said set of data values is optimized to provide maximum compression for a given constraint on the number of codewords; e. building a dictionary and assigning prefix codes to said chosen subset of value prefixes in (d); f. rescanning said input data wherein, for each given value v, finding its longest prefix u in said dictionary, chosen in step (e) and outputting a concatenation of codeword(u) and w, wherein w is the suffix of value v, said suffix comprising remaining uncoded bits of said given value, and wherein said optimization is performed via a dynamic programming algorithm, wherein, in said dynamic programming algorithm, best prefixes are chosen based on whether a given node v is a leaf node or whether the given node v is not a leaf node. - View Dependent Claims (3, 4, 5, 6, 7)
-
-
2. An article of manufacture comprising computer usable medium implementing a computer-based method to compress a dataset of values from one or more domains, said computer-based method implemented in computer readable program code stored in computer memory, said medium comprising:
-
a. computer readable program code aiding in receiving input data; b. computer readable program code aiding in identifying a dictionary size budget; c. computer readable program code aiding in scanning a sample of input data values to identify occurrence frequencies of each value prefix in said sample; d. based on said identified dictionary size budget in (b) and said identified occurrence frequencies in (c), computer readable program code aiding in choosing a subset of value prefixes for which codewords are to be assigned, wherein a choice of assigning prefix codes to common prefixes associated with a subset of said set of data values is optimized to provide maximum compression for a given constraint on the number of codewords; e. computer readable program code aiding in building a dictionary and assigning prefix codes to said chosen subset of value prefixes in (d); f. computer readable program code aiding in rescanning said input data wherein, for each given value v, finding its longest prefix u in said dictionary, chosen in step (e) and outputting a concatenation of codeword(u) and w, wherein w is the suffix of value v, said suffix comprising remaining uncoded bits of said given value, and wherein said optimization is performed via a dynamic programming algorithm, wherein, in said dynamic programming algorithm, best prefixes are chosen based on whether a given node v is a leaf node or whether the given node v is not a leaf node. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A computer-based method to compress a set of data values from one or more domains, said computer-based method implemented in computer readable program code stored in computer memory, said computer-based method comprising the steps of:
-
a. building a dictionary and assigning prefix codes to common prefixes associated with a subset of said set of data values, and b. coding a given value in said set of data values as a combination of a codeword for said given value'"'"'s longest prefix in said dictionary, followed by a suffix associated with said given value, said suffix comprising remaining uncoded bits of said given value, wherein said optimization is performed via a greedy algorithm, and the method further comprises the step of constructing a prefix tree, or trie, said trie representing statistical information about said input data, and, a leaf in said trie represents a complete input value, while an intermediate node represents the common prefix for the set of input values which are represented in a branch of the trie under that node, with statistical information being stored in the form of node labels, said greedy algorithm; (1) computing a benefit for each prefix that measures the value of assigning a codeword to that prefix; (2) choosing a node in said trie with the best benefit; and iteratively repeating steps (1)-(2) until a dictionary budget has been exhausted. - View Dependent Claims (14, 15)
-
Specification