Data compression method and system
First Claim
1. A computer-implemented data compression method of compressing an input stream of bytes, each byte having a position, comprising:
- assigning a token type to each byte of the input stream, the token type assigned to each byte being selected from plural token types;
searching for a matching sequence of already processed bytes that matches a current sequence of bytes;
determining that a selected byte of the current sequence and a corresponding byte of the matching sequence have the same token type;
determining a match length indicating the number of bytes in the matching sequence;
determining a token offset indicating the number of already processed bytes of the same token type positioned between the current sequence and the matching sequence; and
appending to an encoded data stream a match code indicating the token offset and the match length.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for compressing an input stream of data bytes into a compressed stream of data bytes using an LZ77-based compression scheme. The method and system also includes a decompressor for decompressing the compressed stream into a decompressed stream of data bytes that is identical to the input stream. The compression system encodes matches using token offsets rather than the byte offsets used by prior art LZ77-based compression schemes. The compression system uses knowledge of the internal format of the input stream to identify tokens that are used to determine the token offsets. Preferably, the method parses the input stream by dividing it into tokens and assigning a token type to each token. The method searches the input stream for a matching sequence of already processed tokens that is identical to a current sequence of tokens. If a matching sequence is found, the method determines whether the token type of a selected token, such as the first token, of the current sequence matches the token type of a corresponding token of the matching sequence. The method determines a token offset indicating the number of bytes of the matching token type occurring between the selected byte and the corresponding byte of the matching sequence. The method determines the length of the match and encodes the current sequence as a match pair that includes the token offset and the length of the match.
141 Citations
35 Claims
-
1. A computer-implemented data compression method of compressing an input stream of bytes, each byte having a position, comprising:
-
assigning a token type to each byte of the input stream, the token type assigned to each byte being selected from plural token types; searching for a matching sequence of already processed bytes that matches a current sequence of bytes; determining that a selected byte of the current sequence and a corresponding byte of the matching sequence have the same token type; determining a match length indicating the number of bytes in the matching sequence; determining a token offset indicating the number of already processed bytes of the same token type positioned between the current sequence and the matching sequence; and appending to an encoded data stream a match code indicating the token offset and the match length. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer-implemented data compression method of compressing an input stream of data, comprising:
-
parsing the input stream by dividing the input stream into tokens and assigning a token type to each token; searching for a matching sequence of already processed tokens that matches a current sequence of tokens; determining that a selected token of the current sequence and a corresponding token of the matching sequence are each of a first token type; determining a match length indicating the number of tokens in the matching sequence; determining a token offset indicating the number of already processed tokens of the first token type between the current sequence and the matching sequence; and appending to an encoded data stream a match flag code indicating the token offset and the match length. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. A method in a computer system for compressing data, the data comprising a multiplicity of tokens, each token having a token value and one of a plurality of token types, the method comprising the steps of:
-
determining the token type of each token; finding a matching sequence of tokens having the same token values as a current sequence and having a token with the same token type as a corresponding token in the current sequence; and representing the current sequence based on a count of tokens having the same token type, the count of tokens indicating a position of the matching sequence. - View Dependent Claims (18, 19, 21)
-
-
20. The method of clam 19, further including forming the literal from less than all of the token indicated by the literal, such that when the first literal flag code is appended to the literal, the literal and first literal flag code do not include more bits than the token.
-
22. A data compression system for compressing an input stream of data, comprising:
-
a parser that parses the input stream by dividing the input stream into tokens and assigns a token type to each token; a match finder that finds a matching sequence of tokens that matches a current sequence of tokens, the matching sequence having a token with the same token type as a corresponding token in the current sequence; and an encoder that encodes the current sequence as a match code indicating a count of tokens of the same token type, the count of tokens indicating a position of the matching sequence. - View Dependent Claims (23, 24)
-
-
25. A computer storage medium for controlling a computer to compress data, the data including a plurality of tokens, each token having a token value and one of a plurality of token types, comprising:
-
computer instructions for determining the token type of each token; computer instructions for finding a matching sequence of tokens having the same token values as a current sequence and having a token with the same token type as a corresponding token in the current sequence; and computer instructions for representing the current sequence based on a count of tokens having the same token type, the count of tokens indicating a position of the matching sequence. - View Dependent Claims (26, 27, 28, 29)
-
-
30. A method in a computer system for decompressing encoded data into decoded data having a multiplicity of tokens, each token having a token value and one of a plurality of token types, the encoded data comprising an encoded match code associated with a matched sequence of tokens, the method comprising the steps of:
-
determining the token type of a selected token of the matched sequence; determining from the encoded match code a count of tokens having the same token type as the selected token of the matched sequence, the count of tokens indicating a position of a matching sequence of tokens that match the matched sequence of tokens; accessing the matching sequence using the count of tokens; and decoding the matched sequence based on the token values of the matching sequence. - View Dependent Claims (31, 32)
-
-
33. A method in a computer system for compressing a data string having a sequence of data values, each data value having one of a plurality of data value types, the method comprising:
-
identifying the data value type of a current data value in the data string; determining whether a matching data value and its data value type in the data string matches the current data value and its data value type; and encoding the current data value based on a count of data values having the same data value type as the current data value, the count of data values indicating a position of the matching data value in the data string. - View Dependent Claims (34, 35)
-
Specification