Efficient column based data encoding for large-scale data storage
First Claim
1. A computer-implemented method for encoding data to simultaneously compact and organize the data in a manner that facilitates efficient data access operations, including:
- an act of a computer system, which includes at least one processing unit, organizing the data according to a set of column based sequences of values corresponding to different data fields of the data;
an act of the computer system transforming the set of column based sequences of values to a set of column based integer sequences of values according to at least one encoding algorithm; and
an act of the computer system compressing the set of column based integer sequences according to an iterative hybrid compression algorithm, wherein the iterative hybrid compression algorithm includes performing the following, in at least one iterative encoding step;
an act of analyzing the set of column based integer sequences to determine which encoding technique from a plurality of encoding techniques to apply to compress the set of column based integer sequences by at least comparing a first computed bit savings of applying a first encoding technique against a second computed bit savings of applying a second encoding technique; and
an act of applying the first or second encoding technique on at least a portion of the set of column based integer sequences based on the analysis.
2 Assignments
0 Petitions
Accused Products
Abstract
The subject disclosure relates to column based data encoding where raw data to be compressed is organized by columns, and then, as first and second layers of reduction of the data size, dictionary encoding and/or value encoding are applied to the data as organized by columns, to create integer sequences that correspond to the columns. Next, a hybrid greedy run length encoding and bit packing compression algorithm further compacts the data according to an analysis of bit savings. Synergy of the hybrid data reduction techniques in concert with the column-based organization, coupled with gains in scanning and querying efficiency owing to the representation of the compact data, results in substantially improved data compression at a fraction of the cost of conventional systems.
-
Citations
20 Claims
-
1. A computer-implemented method for encoding data to simultaneously compact and organize the data in a manner that facilitates efficient data access operations, including:
-
an act of a computer system, which includes at least one processing unit, organizing the data according to a set of column based sequences of values corresponding to different data fields of the data; an act of the computer system transforming the set of column based sequences of values to a set of column based integer sequences of values according to at least one encoding algorithm; and an act of the computer system compressing the set of column based integer sequences according to an iterative hybrid compression algorithm, wherein the iterative hybrid compression algorithm includes performing the following, in at least one iterative encoding step; an act of analyzing the set of column based integer sequences to determine which encoding technique from a plurality of encoding techniques to apply to compress the set of column based integer sequences by at least comparing a first computed bit savings of applying a first encoding technique against a second computed bit savings of applying a second encoding technique; and an act of applying the first or second encoding technique on at least a portion of the set of column based integer sequences based on the analysis. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer system, comprising:
-
at least one processing unit; and one or more storage media having stored thereon executable instructions that, when executed by the at least one processing unit, carry out a method for encoding data to simultaneously compact and organize the data in a manner that facilitates efficient data access operations, including the following; an act of the computer system organizing the data according to a set of column based sequences of values corresponding to different data fields of the data; an act of the computer system transforming the set of column based sequences of values to a set of column based integer sequences of values according to at least one encoding algorithm; and an act of the computer system compressing the set of column based integer sequences according to an iterative hybrid compression algorithm, wherein the iterative hybrid compression algorithm includes performing the following, in at least one iterative encoding step; an act of analyzing the set of column based integer sequences to determine which encoding technique from a plurality of encoding techniques to apply to compress the set of column based integer sequences by at least comparing a first computed bit savings of applying a first encoding technique against a second computed bit savings of applying a second encoding technique; and an act of applying the first or second encoding technique on at least a portion of the set of column based integer sequences based on the analysis. - View Dependent Claims (15, 16, 17)
-
-
18. One or more physical storage medium storing computer-executable instructions that, when executed by at least one processing unit of a computer system, carry out a method for encoding data to simultaneously compact and organize the data in a manner that facilitates efficient data access operations, including the following:
-
an act of the computer system organizing the data according to a set of column based sequences of values corresponding to different data fields of the data; an act of the computer system transforming the set of column based sequences of values to a set of column based integer sequences of values according to at least one encoding algorithm; and an act of the computer system compressing the set of column based integer sequences according to an iterative hybrid compression algorithm, wherein the iterative hybrid compression algorithm includes performing the following, in at least one iterative encoding step; an act of analyzing the set of column based integer sequences to determine which one or more encoding technique from a plurality of encoding techniques to apply to compress the set of column based integer sequences by at least comparing a first computed bit savings of applying a first encoding technique against a second computed bit savings of applying a second encoding technique; and an act of applying the one or more encoding technique on at least a portion of the set of column based integer sequences based on the analysis. - View Dependent Claims (19, 20)
-
Specification