EFFICIENT COLUMN BASED DATA ENCODING FOR LARGE-SCALE DATA STORAGE
First Claim
1. A method for encoding data, including:
- organizing the data according to a set of column based sequences of values corresponding to different data fields of the data;
transforming the set of column based sequences of values to a set of column based integer sequences of values according to at least one encoding algorithm; and
compressing the set of column based integer sequences according to at least one compression algorithm.
2 Assignments
0 Petitions
Accused Products
Abstract
The subject disclosure relates to column based data encoding where raw data to be compressed is organized by columns, and then, as first and second layers of reduction of the data size, dictionary encoding and/or value encoding are applied to the data as organized by columns, to create integer sequences that correspond to the columns. Next, a hybrid greedy run length encoding and bit packing compression algorithm further compacts the data according to an analysis of bit savings. Synergy of the hybrid data reduction techniques in concert with the column-based organization, coupled with gains in scanning and querying efficiency owing to the representation of the compact data, results in substantially improved data compression at a fraction of the cost of conventional systems.
-
Citations
20 Claims
-
1. A method for encoding data, including:
-
organizing the data according to a set of column based sequences of values corresponding to different data fields of the data; transforming the set of column based sequences of values to a set of column based integer sequences of values according to at least one encoding algorithm; and compressing the set of column based integer sequences according to at least one compression algorithm. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method for encoding data, including:
-
transforming the data to integer sequences of values, each integer sequence serially representing values of a different column of the data; and analyzing the integer sequences to determine whether to apply run length encoding (RLE) compression or bit packing compression including analyzing bit savings of RLE compression relative to bit packing compression to determine where the maximum bit savings is achieved. - View Dependent Claims (8, 9, 10, 11)
-
-
12. A method for encoding data, including:
-
transforming the data to integer sequences of values, each integer sequence serially representing values of a different field of the data; analyzing the integer sequences to determine whether to apply a run length encoding (RLE) compression or a bit packing compression including analyzing bit savings of RLE compression relative to bit packing compression for groups defined over the columns including generating a histogram for the values of the integer sequences to prioritize maximum bit savings.
-
-
13. An encoder, including:
-
an organization component for organizing raw data received as a set of serialized values corresponding to different fields or columns of the raw data to form columnized sequences of data; a data encoding component that performs at least one of dictionary encoding or value encoding to uniformly represent the columnized sequences of data as integer sequences; and a compression component that determines on which portion of which of the integer sequences to perform compression next, and whether to perform the compression with run length encoding (RLE) that represents a repeated value as a run or a bit packing algorithm that seeks to minimize number of bits used to represent a portion including, for each portion defined for the integer sequences, analyzing a performance metric of RLE relative to bit packing. - View Dependent Claims (14, 15)
-
-
16. A method for encoding data, comprising:
-
receiving at least one portion of an integer sequence of values representing a column of data; reducing a number of bits used to represent each integer based on determining a minimum number of bits to use for the at least one portion of the integer sequence; removing any shared numerical powers across the values of the at least one portion of the integer sequence; and offsetting the values of the at least one portion of the integer sequence spanning a range, further reducing the number of bits. - View Dependent Claims (17, 18, 19, 20)
-
Specification