Compressing sets of integers
First Claim
1. A computer implemented method of compressing a data stream that includes a set of integers in order to produce a compressed byte stream, the method comprising the steps of:
- examining the data stream to identify a subset of integers that share one or more common leading bytes;
general a first group of bytes that represents the common leading bytes of the identified subset of integers, the first group of bytes comprising only a single instance of the common leading bytes in order to compress the overall data stream;
generating a second group of bytes that represents a truncated form of the integers of the identified subset, the truncated form resulting from removal of the common leading bytes; and
generating a compressed byte stream that includes the first and second groups of bytes.
2 Assignments
0 Petitions
Accused Products
Abstract
In one aspect, the disclosed technique detects common leading byte patterns in the integers so that these patterns need only be stored once in the encoded byte stream. Those integers that share a common leading byte pattern are stored in truncated form, without their common leading bytes. These truncated integers may themselves be further examined to determine if any of them share additional common leading bytes beyond those already detected. Thus, the technique lends itself naturally to description using the language of trees. Integers with a common leading byte pattern are stored as child nodes, their parent being the node containing the common byte pattern. Child nodes consist only of those bytes remaining after the initial byte pattern has been extracted; the greater the number of children, the greater are the efficiency gains. All the children of a given tree or subtree are similarly examined for common leading byte patterns, ignoring those bytes that are already accounted for in their ancestor nodes. In a second aspect, the disclosed technique makes use of "clustering", a second type of locality that is not reached by the interval concept. A cluster is a sequence of singleton integers that are very close together but do not form a contiguous interval. The technique recognizes that such a cluster can be compactly stored as a bitmap, in which each active bit ("1-bit") represents a member of the cluster. The choice of bitmap size (e.g., 1 byte, 2 bytes, etc.) can be calibrated to suit the clustering characteristics of the input data set.
-
Citations
26 Claims
-
1. A computer implemented method of compressing a data stream that includes a set of integers in order to produce a compressed byte stream, the method comprising the steps of:
-
examining the data stream to identify a subset of integers that share one or more common leading bytes; general a first group of bytes that represents the common leading bytes of the identified subset of integers, the first group of bytes comprising only a single instance of the common leading bytes in order to compress the overall data stream; generating a second group of bytes that represents a truncated form of the integers of the identified subset, the truncated form resulting from removal of the common leading bytes; and generating a compressed byte stream that includes the first and second groups of bytes. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer implemented method of compressing a data stream that includes a set of integers in order to produce a compressed byte stream, the method comprising the steps of:
-
examining the data stream to identify a subset of integers that share one or more common leading bytes that are suited for representation using a bitmap having a size equal to a selected number of bytes; generating a first group of bytes that represents the common leading bytes of the identified subset of integers, the first group of bytes comprising only a single instance of the common leading bytes in order to compress the overall data stream; generating a second group of bytes that represents a truncated form of the integers of the identified subset, the truncated form resulting from removal of the common leading bytes, comprising the steps of; selecting the lowest-value integer of the identified subset as a bitmap reference point and including the truncated form of the selected lowest-value integer in the second group of bytes; creating a bitmap having a size equal to the selected number of bytes, wherein each of the integers in the identified subset, other than the lowest-value integer, is mapped in the bitmap; and including the bitmap in the second group of bytes; and generating a compressed byte stream that includes the first and second groups of bytes. - View Dependent Claims (9, 10, 11, 12)
-
-
13. A computer program, residing on a computer readable medium, for compressing a data stream that includes a set of integers in order to produce a compressed byte stream, the computer program comprising instructions for executing the steps of:
-
examining the data stream to identify a subset of integers that share one or more common leading bytes; generating a first group of bytes that represents the common leading bytes of the identified subset of integers, the first group of bytes comprising only a single instance of the common leading bytes in order to compress the overall data stream; generating a second group of bytes that represents a truncated form of the integers of the identified subset, the truncated form resulting from removal of the common leading bytes; and generating a compressed byte stream that includes the first and second groups of bytes. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
-
20. A computer program, residing on a computer readable medium, for compressing a data stream that includes a set of integers in order to produce a compressed byte stream, the computer program comprising instructions for executing the steps of:
-
examining the data stream to identify a subset of integers that share one or more common leading bytes and that are suited for representation using a bitmap having a size equal to a selected number of bytes; generating a first group of bytes that represents the common leading bytes of the identified subset of integers, the first group of bytes comprising only a single instance of the common leading bytes in order to compress the overall data stream; generating a second group of bytes that represents a truncated form of the integers of the identified subset, the truncated form resulting from removal of the common leading bytes comprising the steps of; selecting the lowest-value integer of the identified subset as a bitmap reference point and including the truncated form of the selected lowest-value integer in the second group of bytes; creating a bitmap having a size equal to the selected number of bytes, wherein each of the integers in the identified subset, other than the lowest-value integer, is mapped in the bitmap; and including the bitmap in the second group of bytes; and generating a compressed byte stream that includes the first and second groups of bytes. - View Dependent Claims (21, 22, 23, 24)
-
-
25. A computer implemented method of compressing a set of integers to produce an encoded byte stream, the method comprising:
-
selecting a subset of integers suited for bitmap representation; selecting a reference point; and creating a bitmap of one or more bytes in which the state of a particular bit represents whether a particular integer is contained in the subset and in which the location of the particular bit in the bitmap represents the location of the integer relative to the reference point, wherein the integer whose location is relative to the reference point is given by the function I(b,r)=r+b+1, where r is the reference point and b is the bit position in the bitmap.
-
-
26. A computer program, residing on a computer readable medium, for compressing a set of integers to produce an encoded byte stream, the computer program comprising instructions for:
-
selecting a subset of integers suited for bitmap representation; selecting a reference point; and creating a bitmap of one or more bytes in which the state of a particular bit represents whether a particular integer is contained in the subset and in which the location of the particular bit in the bitmap represents the location of the integer relative to the reference point, wherein the integer whose location is relative to the reference point is given by the function I(b,r)=r+b+1, where r is the reference point and b is the bit position in the bitmap.
-
Specification