Method and system for adaptively building a static Ziv-Lempel dictionary for database compression
First Claim
1. In a computer-implemented system for compressing input data consisting of sequences of source symbols selected from a source alphabet to form output data consisting of sequences of code symbols selected from a code alphabet according to a static dictionary stored in memory, said dictionary representing a static parse-tree having nodes representing said code symbols, said nodes being linked into paths representing said source symbol sequences, a method for creating said static dictionary comprising the steps of:
- (a) repeatedly performing the steps of(a.1) determining a source symbol sequence from said input data,(a.2) adding at least one node to said parse-tree responsive to said source symbol sequence, and(a.3) assigning a use count value to said at least one node responsive to the number of said source symbol sequence occurrences; and
(b) reducing said parse-tree to a first predetermined plurality of nodes by repeatedly deleting from said parse-tree one or more childless nodes having a use count value less than a predetermined use count value threshold.
1 Assignment
0 Petitions
Accused Products
Abstract
A system for creating a static data compression dictionary adapted to a hardware-based data compression architecture. A static Ziv-Lempel dictionary is created and stored in memory for use in compressing database records. No data compression occurs during dictionary construction. A fixed-size Ziv-Lempel parse-tree is adapted to database characteristics in one of two alternate ways. First, the parse-tree is overbuilt substantially and then pruned back to a static size by eliminating the least recently used (LRU) nodes having the lowest use count. Alternatively, the parse-tree is built to a static size and thereafter selected nodes are replaced with new nodes upon database sampling. This node recycling procedure chooses the least-useful nodes for replacement according to a use count and LRU strategy while exhausting the database sample. The pruned Ziv-Lempel parse-tree is then transformed to a static dictionary configuration and stored in memory for use in a hardware-based database compression procedure. Completion of the static dictionary before starting data compression eliminates the initial compression inefficiencies well-known for the Ziv-Lempel procedure. The parse-tree construction is enhanced by initializing the tree with NULL and DEFAULT sequences from database definitions before examining any data.
134 Citations
12 Claims
-
1. In a computer-implemented system for compressing input data consisting of sequences of source symbols selected from a source alphabet to form output data consisting of sequences of code symbols selected from a code alphabet according to a static dictionary stored in memory, said dictionary representing a static parse-tree having nodes representing said code symbols, said nodes being linked into paths representing said source symbol sequences, a method for creating said static dictionary comprising the steps of:
-
(a) repeatedly performing the steps of (a.1) determining a source symbol sequence from said input data, (a.2) adding at least one node to said parse-tree responsive to said source symbol sequence, and (a.3) assigning a use count value to said at least one node responsive to the number of said source symbol sequence occurrences; and (b) reducing said parse-tree to a first predetermined plurality of nodes by repeatedly deleting from said parse-tree one or more childless nodes having a use count value less than a predetermined use count value threshold. - View Dependent Claims (2)
-
-
3. In a computer-implemented system for compressing input data arranged in a data stream of one or more records consisting of sequences of source symbols selected from a source alphabet to form output data consisting of sequences of code symbols selected from a code alphabet according to a static dictionary stored in a memory, said dictionary representing a static parse-tree having nodes representing said code symbols, said nodes being linked into paths representing said source symbol sequences, a method for creating said static dictionary comprising the steps of:
-
(a) initializing a parse-tree with a plurality of said paths representing a set of said source symbol strings, each said path having at least one node with a unity use count value;
.(b) setting a current input pointer at the beginning of said data stream;
performing the steps of(c.1) determining the longest said source symbol sequence S, represented by a path P in said parse-tree, that matches a current said source symbol sequence in said data stream beginning at said current input pointer, (c.2) incrementing said use count value for all nodes in said path P, (c.3) adding a new node N having a unity use count value to the end of said path P to form a new path P'"'"' representing a new source symbol sequence S'"'"' consisting of said string S extended by at least one immediately subsequent source symbol in said data stream, and (c.4) advancing said current input pointer to immediately after said sequence S'"'"' in said data stream; (s) if said parse-tree contains less than a first predetermined plurality of nodes, repeating said performing step (c); (e) combining with its parent node one or more child nodes in said parse-tree, said child nodes each having a single-child parent node for which said use count value differs by no more than one from said use count value for each child node, thereby forming one or more new leaf nodes; (f) assembling said nodes with the associated said paths to form said static dictionary; and (g) storing said static dictionary in said memory. - View Dependent Claims (4, 5)
-
-
6. In a computer-implemented system for compressing input data arranged in a data stream of one or more records consisting of sequences of source symbols selected from a source alphabet to form output data consisting of sequences of code symbols selected from a code alphabet according to a static dictionary stored in a memory, said dictionary representing a static parse-tree having nodes representing said code symbols, said nodes being Linked into paths representing said source symbol sequences, a method for creating said static dictionary comprising the steps of:
-
(a) initializing a parse-tree with a plurality of said paths representing a set of said source symbol strings, each said path having at least one node with a unity use count value; (b) setting a current input pointer at the beginning of said data stream; (c) until said data stream is exhausted, repeatedly performing the steps of; (c.1) determining the longest said source symbol sequence S, represented by a path P in said parse-tree, that matches a current said source symbol sequence in said data stream beginning at said current input pointer, (c.2) incrementing said use count value for all nodes in said path P, (c.3) if said sequence S is not the final sequence in a record, adding a new node N having a unity use count value to said path P to form a new path P'"'"' representing a new source symbol sequence S'"'"' consisting of said sequence S extended by at least one immediately subsequent said source symbol in said data stream, (c.4) linking said new node N to the end of a LRU chain, (c.5) advancing said current input pointer to the end of said sequence S'"'"' in said data stream, and (c.6) performing the steps of (e.6.1) initializing a use count value threshold to a fourth predetermined value, and (e.6.2) discarding from said LRU chain one said node whose use count value does not exceed said use count value threshold; (d) assembling said nodes with the associated said paths to form said static dictionary; and (e) storing said static dictionary in said memory. - View Dependent Claims (7, 8)
-
-
9. The method of claim 9 wherein said linking step (c.4) comprises the steps of:
-
(c.4.1) creating a leaf chain by sequentially linking each childless node in said LRU chain; and (c.4.2) combining with its parent node one or more child nodes in said leaf chain, said child nodes each having a single-child parent node for which said use count value differs by no more than one from said use count value for said each child node, thereby forming one or more new leaf nodes.
-
-
10. A computer-implemented system for compressing input data arranged in a data stream of one or more records consisting of sequences of source symbols selected from a source alphabet to form output data consisting of sequences of code symbols selected from a code alphabet according to a static dictionary stored in a memory, said dictionary representing a static parse-tree having nodes representing said code symbols, said nodes being linked into paths representing said source symbol sequences, said system comprising
primer means for initializing a parse-tree with a plurality of said paths representing a set of said source symbol strings, each said path having at least one node with a unity use count value; -
pointer means for setting a current input pointer at the beginning of said data stream; comparator means for determining the longest said source symbol sequence S, represented by a path P in said parse-tree, that matches a current said source symbol sequence in said data stream beginning at said current input pointer; adder means for incrementing said use count value for all nodes in said path P; extender means for adding a new node N having a unity use count value to the end of said path P to form a new path P'"'"' representing a new source symbol sequence S'"'"' consisting of said string S extended by at least one immediately subsequent source symbol in said data stream; register means for advancing said current input pointer to immediately after said sequence S'"'"' in said data stream; counter means for determining when the number of said nodes in said parse-tree exceeds a first predetermined plurality of nodes; first linker means for sequentially linking each childless node in said parse tree to form a leaf node chain; node combiner means for combining with its parent node a child node in said leaf node chain having a single-child parent node for which said use count value differs by no more than one from said use count value for said child node, thereby forming a new leaf node;
transformer means for assembling said nodes with the associated said paths to form said static dictionary; andstorage means for storing said static dictionary in said memory. - View Dependent Claims (11, 12)
-
Specification