Method for compressed data with reduced dictionary sizes by coding value prefixes

US 7,609,179 B2
Filed: 01/08/2008
Issued: 10/27/2009
Est. Priority Date: 01/08/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-based method to compress a dataset of values from one or more domains, said computer-based method implemented in computer readable program code stored in computer memory, said computer-based method comprising the steps of:

a. receiving input data;

b. identifying a dictionary size budget;

c. scanning a sample of input data values to identify occurrence frequencies of each value prefix in said sample;

d. based on said identified dictionary size budget in (b) and said identified occurrence frequencies in (c), choosing a subset of value prefixes for which codewords are to be assigned, wherein a choice of assigning prefix codes to common prefixes associated with a subset of said set of data values is optimized to provide maximum compression for a given constraint on the number of codewords;

e. building a dictionary and assigning prefix codes to said chosen subset of value prefixes in (d);

f. rescanning said input data wherein, for each given value v, finding its longest prefix u in said dictionary, chosen in step (e) and outputting a concatenation of codeword(u) and w, wherein w is the suffix of value v, said suffix comprising remaining uncoded bits of said given value, andwherein said optimization is performed via a dynamic programming algorithm, wherein, in said dynamic programming algorithm, best prefixes are chosen based on whether a given node v is a leaf node or whether the given node v is not a leaf node.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The speed of dictionary based decompression is limited by the cost of accessing random values in the dictionary. If the size of the dictionary can be limited so it fits into cache, decompression is made to be CPU bound rather than memory bound. To achieve this, a value prefix coding scheme is presented, wherein value prefixes are stored in the dictionary to get good compression from small dictionaries. Also presented is an algorithm that determines the optimal entries for a value prefix dictionary. Once the dictionary fits in cache, decompression speed is often limited by the cost of mispredicted branches during Huffman code processing. A novel way is presented to quantize Huffman code lengths to allow code processing to be performed with few instructions, no branches, and very little extra memory. Also presented is an algorithm for code length quantization that produces the optimal assignment of Huffman codes and show that the adverse effect of quantization on the compression ratio is quite small.

268 Citations

15 Claims

1. A computer-based method to compress a dataset of values from one or more domains, said computer-based method implemented in computer readable program code stored in computer memory, said computer-based method comprising the steps of:
- a. receiving input data;
  
  b. identifying a dictionary size budget;
  
  c. scanning a sample of input data values to identify occurrence frequencies of each value prefix in said sample;
  
  d. based on said identified dictionary size budget in (b) and said identified occurrence frequencies in (c), choosing a subset of value prefixes for which codewords are to be assigned, wherein a choice of assigning prefix codes to common prefixes associated with a subset of said set of data values is optimized to provide maximum compression for a given constraint on the number of codewords;
  
  e. building a dictionary and assigning prefix codes to said chosen subset of value prefixes in (d);
  
  f. rescanning said input data wherein, for each given value v, finding its longest prefix u in said dictionary, chosen in step (e) and outputting a concatenation of codeword(u) and w, wherein w is the suffix of value v, said suffix comprising remaining uncoded bits of said given value, andwherein said optimization is performed via a dynamic programming algorithm, wherein, in said dynamic programming algorithm, best prefixes are chosen based on whether a given node v is a leaf node or whether the given node v is not a leaf node.
- View Dependent Claims (3, 4, 5, 6, 7)
- - 3. The computer-based method of claim of claim 1, wherein best prefixes are chosen as per the following benefit function:
    - for the case when a given node v is a leaf node, benefit(v,K,L)=max{max {benefit(left(v), i, L+1)+benefit(right(v), K−
      
      i, L+1)|0<
      
      =i<
      
      =K }, max{benefit(left(v), i−
      
      1,
      
      1)+benefit(right(v), K−
      
      i,
      
      1)|1<
      
      =i<
      
      =K }};
      
      orfor the case when a given node v is not a leaf node, benefit(v,K,L)=0 if K=1, or benefit(v,K,L)=benefit(v) if K>
      
      =1;
      
      wherein benefit of a node v is given by;
      
      benefit(v)=frequency(v)×
      
      neck(v)−
      
      frequency(v)log(1/frequency(v));
      
      wherein K represents a given dictionary budget, neck(v) represents the neck of a node v, L represents a neck length associated with a node, and left(v) and right(v) represent left and right nodes with respect to node v.
  - 4. The computer-based method of claim 1, wherein the method additionally comprises the step of applying Huffman coding and assigning shorter codewords to more frequent prefix codes.
  - 5. The computer-based method of claim 1, wherein a choice of assigning prefix codes to common prefixes associated with a subset of said set of data values is optimized to provide maximum compression for a given constraint on the number of codewords.
  - 6. The computer-based method of claim 1 wherein a number of codewords are chosen such that said dictionary is chosen to fit within a cache associated with a computer implementing said computer-based method.
  - 7. The computer-based method of claim 1 wherein said values in dataset are clustered via a clustering method, codewords being assigned to cluster identifiers instead of value prefixes.

2. An article of manufacture comprising computer usable medium implementing a computer-based method to compress a dataset of values from one or more domains, said computer-based method implemented in computer readable program code stored in computer memory, said medium comprising:
- a. computer readable program code aiding in receiving input data;
  
  b. computer readable program code aiding in identifying a dictionary size budget;
  
  c. computer readable program code aiding in scanning a sample of input data values to identify occurrence frequencies of each value prefix in said sample;
  
  d. based on said identified dictionary size budget in (b) and said identified occurrence frequencies in (c), computer readable program code aiding in choosing a subset of value prefixes for which codewords are to be assigned, wherein a choice of assigning prefix codes to common prefixes associated with a subset of said set of data values is optimized to provide maximum compression for a given constraint on the number of codewords;
  
  e. computer readable program code aiding in building a dictionary and assigning prefix codes to said chosen subset of value prefixes in (d);
  
  f. computer readable program code aiding in rescanning said input data wherein, for each given value v, finding its longest prefix u in said dictionary, chosen in step (e) and outputting a concatenation of codeword(u) and w, wherein w is the suffix of value v, said suffix comprising remaining uncoded bits of said given value, andwherein said optimization is performed via a dynamic programming algorithm, wherein, in said dynamic programming algorithm, best prefixes are chosen based on whether a given node v is a leaf node or whether the given node v is not a leaf node.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The article of manufacture of claim 2, wherein best prefixes are chosen as per the following benefit function:
    - for the case when a given node v is a leaf node, benefit(v,K,L)=max{max {benefit(left(v), i, L+1)+benefit(right(v), K−
      
      i, L+1)|0<
      
      =i<
      
      =K}, max{benefit(left(v), i−
      
      1,
      
      1)+benefit(right(v), K−
      
      i,
      
      1)|1<
      
      =i<
      
      =K}};
      
      orfor the case when a given node v is not a leaf node, benefit(v,K,L)=0 if K=1, or benefit(v,K,L)=benefit(v) if K>
      
      =1;
      
      wherein benefit of a node v is given by;
      
      benefit(v)=frequency(v)×
      
      neck(v)−
      
      frequency(v)log(1/frequency(v));
      
      wherein K represents a given dictionary budget, neck(v) represents the neck of a node v, L represents a neck length associated with a node, and left(v) and right(v) represent left and right nodes with respect to node v.
  - 9. The article of manufacture of claim 2, wherein the method additionally comprises the step of applying Huffman coding and assigning shorter codewords to more frequent prefix codes.
  - 10. The article of manufacture of claim 2, wherein a choice of assigning prefix codes to common prefixes associated with a subset of said set of data values is optimized to provide maximum compression for a given constraint on the number of codewords.
  - 11. The article of manufacture of claim 2, wherein a number of codewords are chosen such that said dictionary is chosen to fit within a cache associated with a computer implementing said computer-based method.
  - 12. The article of manufacture of claim 2, wherein said values in dataset are clustered via a clustering method, codewords being assigned to cluster identifiers instead of value prefixes.

13. A computer-based method to compress a set of data values from one or more domains, said computer-based method implemented in computer readable program code stored in computer memory, said computer-based method comprising the steps of:
- a. building a dictionary and assigning prefix codes to common prefixes associated with a subset of said set of data values, andb. coding a given value in said set of data values as a combination of a codeword for said given value'"'"'s longest prefix in said dictionary, followed by a suffix associated with said given value, said suffix comprising remaining uncoded bits of said given value,wherein said optimization is performed via a greedy algorithm, and the method further comprises the step of constructing a prefix tree, or trie, said trie representing statistical information about said input data, and, a leaf in said trie represents a complete input value, while an intermediate node represents the common prefix for the set of input values which are represented in a branch of the trie under that node, with statistical information being stored in the form of node labels, said greedy algorithm;
  
  (1) computing a benefit for each prefix that measures the value of assigning a codeword to that prefix;
  
  (2) choosing a node in said trie with the best benefit; and
  
  iteratively repeating steps (1)-(2) until a dictionary budget has been exhausted.
- View Dependent Claims (14, 15)
- - 14. The computer-based method of claim 13 wherein a number of codewords are chosen such that said dictionary is chosen to fit within a cache associated with a computer implementing said computer-based method.
  - 15. The computer-based method of claim 14 wherein said values in dataset are clustered via a clustering method, codewords being assigned to cluster identifiers instead of value prefixes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation, ServiceNow Incorporated
Original Assignee
International Business Machines Corporation
Inventors
Raman, Vijayshankar, Swart, Garret, Diaz-Gutierrez, Pablo
Primary Examiner(s)
JEAN PIERRE, PEGUY

Application Number

US11/970,844
Publication Number

US 20090174583A1
Time in Patent Office

658 Days
Field of Search

341/50, 341/106, 341/65, 341/67, 341/78, 341/79
US Class Current

341/51
CPC Class Codes

H03M 7/40 Conversion to or from varia...

Method for compressed data with reduced dictionary sizes by coding value prefixes

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

268 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Method for compressed data with reduced dictionary sizes by coding value prefixes

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

268 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links