Method and system for compression and decompression using variable-sized offset and length fields

US 5,933,104 A
Filed: 11/22/1995
Issued: 08/03/1999
Est. Priority Date: 11/22/1995
Status: Expired due to Term

First Claim

Patent Images

1. In a computer system, a method for compressing a sequence of data, comprising:

(a) dividing the sequence of data into a series of blocks;

(b) identifying a pattern of data located at a given location in a block that also occurs earlier in the data at a previous location in the block;

(c) encoding the pattern of data at the given location in the block as a copy token having a fixed number of bits, wherein said copy token includes an offset field that identifies an offset between the pattern of data at the given location in the block and the pattern of data at the previous location in the block at which the pattern of data also occurred and wherein how many bits that are included in the offset field depends upon the offset between the given location in the block and the previous location in the block for the pattern of data.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer system includes a compression engine for compressing a decompressed sequence of data to produce a compressed sequence of data. The compression engine encodes each piece of data in the decompressed sequence of data as either a portion of a copy token or as a literal token. Tokens are grouped together into groups of up to 8 tokens and a bitmap holding 8 bits is provided to identify the respective tokens as either copy tokens or literal tokens. The copy tokens encode sub-sequences of data that have previously occurred in the decompressed data sequence. Each copy token is of a like size but includes a variable-sized offset field for encoding an offset between a current occurrence of a sub-sequence of data and a previous occurrence of a sub-sequence of data. The offset field is variable-sized to encode the offset in a minimal number of bits. The computer system also includes a decompression engine for decompressing data sequences that have been compressed using the compression engine.

Citations

34 Claims

1. In a computer system, a method for compressing a sequence of data, comprising:
- (a) dividing the sequence of data into a series of blocks;
  
  (b) identifying a pattern of data located at a given location in a block that also occurs earlier in the data at a previous location in the block;
  
  (c) encoding the pattern of data at the given location in the block as a copy token having a fixed number of bits, wherein said copy token includes an offset field that identifies an offset between the pattern of data at the given location in the block and the pattern of data at the previous location in the block at which the pattern of data also occurred and wherein how many bits that are included in the offset field depends upon the offset between the given location in the block and the previous location in the block for the pattern of data.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1 wherein the number of bits that are included in the offset field are at least equal to a minimum number of bits required to encode an offset from a start of the given location in the block to a start of the data in the block.
  - 3. The method of claim 1 wherein the pattern of data at the previous location in the block must have at least a minimum number of bytes in order for the pattern of data at the given location in the block to be encoded as a copy token.
  - 4. The method of claim 1 wherein the copy token further includes a length field that encodes a length of the pattern of data at the previous location in the block.
  - 5. The method of claim 1 wherein the pattern of data at the previous location in the block includes multiple bytes of data.
  - 6. The method of claim 1 wherein identifying the pattern of data at the given location in the block further comprises:
    - (a) calculating a hash value of the pattern of data using a hash function;
      
      (b) using the hash value as an index to a hash table to locate a hash table entry; and
      
      (c) examining a pointer in the hash table entry to locate the occurrence of the pattern of data at the previous location in the block.

7. In a computer system, a method of compressing a sequence of data into chunks of compressed data, comprising:
- (a) dividing the sequence of data into a series of blocks;
  
  (b) processing a first portion of the data at a location in a block to identify at least one sub-sequence of data in the first portion to compress;
  
  (c) determining another sub-sequence of data to be compressed at another location in a second portion of the block;
  
  (d) determining whether at least part of the other sub-sequence of data in the second portion of the block matches at least part of the sub-sequence of data in the first portion of the block;
  
  (e) where at least part of the other sub-sequence of data in the second portion of the block does not match at least part of the sub-sequence of data in the first portion of the block, encoding the not matched part of the other sub-sequence of data in the second portion of the block as a literal token, said literal token being added to a chunk of compressed data that is associated with the block; and
  
  (f) where at least part of the other sub-sequence of data in the second portion of the block does match at least part of the sub-sequence of data in the first portion of the block, encoding the matched part of the other sub-sequence of data in the second portion of the block into a copy token with a fixed number of bits that is added to the chunk of compressed data associated with the block the copy token including an offset field that identifies an offset between the other location of the matched part of the other sub-sequence of data in the second portion of the block and the location of the matched part of the sub-sequence of data in the first portion of the block, and a length field that identifies a length of the matched part of the sub-sequence of data in the first portion of the block, wherein how many bits are in the offset field depends on the position of the matched part at the other location of the other sub-sequence of data in the second portion of the block.
- View Dependent Claims (8, 9, 10, 11)
- - 8. The method of claim 7 wherein a number of bits in the offset field equals a minimum number of bits required to encode an offset from a start of the matched part of the other sub-sequence of data in the second portion of the block to a start of the matched part of the sub-sequence of data in the first portion of the block.
  - 9. The method of claim 7 wherein the matched part of the other sub-sequence of data in the second portion of the block includes at least a threshold quantity of data in order for the matched part of the other sub-sequence of data in the second portion of the block to be encoded as the copy token.
  - 10. The method of claim 9 wherein the matched part of the other sub-sequence of data in the second portion of the block includes multiple bytes of data.
  - 11. The method of claim 7 wherein the sub-sequence of data in the first portion of the block is a contiguous sub-sequence of data.

12. In a computer system, a method of compressing a file comprising pieces of data, comprising:
- (a) dividing the file into blocks of data;
  
  (b) separately compressing each block of data into a chunk of compressed data by performing the following;
  
  (i) sequentially examining each sub-sequence of data in each block of data;
  
  (ii) encoding each subsequent sub-sequence of data in a block that is over a minimum threshold length and that has occurred previously in the block as a copy token of a predetermined number of bits, each copy token including an offset field that specifies an offset between the subsequent occurrence of the sub-sequence of data at a location in the block and another location in the block that the sub-sequence of data first occurred and a length field that specifies a length of the first occurrence of the sub-sequence of data at the other location in the block, wherein how many bits are used in the offset field depends on the location of the subsequent occurrence of the sub-sequence of data within the block; and
  
  (iii) encoding each sub-sequence of data in the block, that is not encoded as a copy token, as a literal token in the compressed chunk.
- View Dependent Claims (13, 14, 15)
- - 13. The method of claim 12, further comprising:
    - (a) for each chunk of compressed data, aggregating the copy tokens and the literal tokens into at least one group of sequentially contiguous tokens, each group of sequentially continuous tokens including at most a predetermined number of the copy tokens and the literal tokens; and
      
      (b) adding a mask to each chunk of compressed data for each group of sequentially contiguous tokens wherein the mask is associated with the group and identifies each token in the group as a literal token or a copy token.
  - 14. The method of claim 13 wherein each literal token and each copy token for each group has a corresponding bit in the mask that is associated with the group.
  - 15. The method of claim 13 wherein each associated mask is at least a byte in length.

16. In a computer system, a method of compressing a sequence of blocks of data, comprising the computer-implemented steps of:
- (a) compressing a first sub-sequence of data at a first location in a block of data by encoding the first sub-sequence of data as a first copy token having a fixed number of bits, said first copy token including an offset field that has a first number of bits and that encodes an offset between the first location and a previous occurrence of the first sub-sequence of data in the block of data; and
  
  (b) compressing a second sub-sequence of data at a second location in the block of data as a second copy token having the fixed number of bits, said second copy token having another offset field that has a second number of bits that differs from the first number of bits and that encodes an offset between the second location in the block of data and a previous occurrence of the second sub-sequence of data in the block of data.
- View Dependent Claims (17, 18)
- - 17. The method of claim 16 wherein the first number of bits in the offset field of the first copy token equals at least a minimum number of bits required to produce an offset between the first sub-sequence of data and a beginning of the block of data.
  - 18. The method of claim 16 wherein the second number of bits in the other offset field of the second copy token equals at least a minimum number of bits required to encode an offset between the second sub-sequence of data and a beginning of the block of data.

19. In a computer system, a method of decompressing an item in a chunk of a compressed file having a number of separate chunks, comprising:
- (a) identifying which chunk of the compressed file holds the item, said identified chunk including like-sized copy tokens and also including literal tokens;
  
  (b) decompressing the chunk of the compressed file that has been identified as holding the item while keeping other chunks compressed, said decompressing comprising;
  
  (i) identifying a first of the copy tokens that encodes a current sub-sequence of data that includes the item, said first copy token including an offset field that specifies an offset between the current sub-sequence of data and a previous occurrence of the sub-sequence of data that is included in a sequence of literal tokens;
  
  (ii) identifying how many bits are in the offset field by identifying a location of a first of the sequence of literal tokens that encode the previous occurrence of the sub-sequence of data; and
  
  (iii) decompressing the first copy token by replacing the first copy token with the previous occurrence of the sub-sequence of data that the sequence of literal tokens encode.

20. In a computer system, a method of decompressing a sequence of chunks of compressed data containing copy tokens and literal tokens, wherein the copy tokens each contain an offset field and a number of bits in the offset field that depends upon a location in a sequence of data prior to compressing the sequence of data into the sequence of chunks of compressed data, said method comprising:
- for each of the copy tokens,(i) identifying a number of bits in the offset field by determining a location of a sub-sequence of data prior to encoding by the copy token;
  
  (ii) using the identified offset field to locate the sub-sequence of data that is encoded by the copy token;
  
  (iii) replacing the copy token with the sub-sequence of data that is encoded by the copy token; and
  
  for each literal token, keeping the literal token in the sequence.

21. A computer system comprising:
- (a) a storage for storing a sequence of data; and
  
  (b) a compression engine for compressing the sequence of data into at least one chunk of compressed data that includes a sequence of copy tokens and literal tokens, each copy token encoding a sub-sequence of data as a copy of a like sub-sequence of data that has previously occurred in the sequence of data and each literal token encoding a literal piece of data wherein each copy token is of like size and includes a variable-sized offset field having a number of bits that is based on the location of the sub-sequence encoded by the copy token in the sequence of data.
- View Dependent Claims (22)
- - 22. The computer system of claim 21 wherein the compression engine further comprises:
    - (a) a token grouper for grouping sequentially occurring literal tokens and copy tokens into groups in the compressed block; and
      
      (b) a mask generator for generating a mask for each group, each bit in each mask corresponding to one of the literal tokens or the copy tokens in the group and identifying each corresponding token as a literal token or a copy token.

23. A computer system comprising:
- (a) a storage for storing at least one chunk of compressed data, each chunk of compressed data including copy tokens that encode copies of previously occurring sub-sequences of data and literal tokens that literally encode pieces of data, wherein the copy tokens are all of a like size and include a variable-length offset field that encodes an offset to a previous occurrence of the sub-sequence of data; and
  
  (b) a decompression engine for decompressing each chunk of compressed data into a decompressed sequence of data, said decompression engine including a copy token decompressor for decompressing the copy tokens, said copy token decompressor identifying how many bits are in the offset fields of each copy token based on a location of the previous occurrence of the sub-sequence of data prior to encoding as the copy token.

24. A computer-readable storage media comprising:
- a compression engine for compressing a sequence of data into at least one chunk of compressed data, each chunk including a sequence of copy tokens and literal tokens, each copy token encoding a sub-sequence of data as a copy of an identical sub-sequence of data that has previously occurred in the decompressed sequence of data and each literal token encoding a literal piece of data wherein each copy token is of like size and includes a variable-sized offset field having a number of bits that is based on a location in the sequence of data of a subsequent occurrence of the identical sub-sequence of data prior to the encoding of the subsequent occurrence of the identical sub-sequence as the copy token.

25. A computer-readable storage media comprising:
- a decompression engine for decompressing a chunk of compressed data into a block of data, said decompression engine including a copy token decompressor for decompressing the copy tokens, said copy token decompressor identifying how many bits are in the offset fields of each copy token based on a previous location in the data of an identical sub-sequence of data prior to the encoding of a subsequent occurrence of the identical sub-sequence of data as the copy token.

26. In a computer system, a method for compressing a sequence of data, comprising:
- (a) dividing the sequence of the data into a series of consecutive blocks, each block having a predetermined amount of data;
  
  (b) compressing each block of data into a chunk of compressed data, the compression of a block of data comprising;
  
  (i) sequentially examining data in the block to identify a pattern of data;
  
  (ii) encoding a copy token to represent each subsequent occurrence of each pattern in the block of data that has initially occurred at a previous location in the block of data;
  
  (iii) encoding a literal token to represent each initial occurrence of each pattern in the block of data; and
  
  (iv) producing a sequence of each literal token and each copy token.
- View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34)
- - 27. The method of claim 26, wherein the dividing of the sequence of the data into a series of consecutive blocks, further comprises selecting a size for the predetermined amount of data in each block so that the compression of the sequence of data is optimized.
  - 28. The method of claim 26, further comprises decompressing each chunk of compressed data, comprising:
    - (a) replacing each literal token with the initial occurrence of the pattern in the block of data that is represented by the literal token; and
      
      (b) replacing each copy token with the subsequent occurrence of the pattern in the block of data that is represented by the copy token, so that the sequence of data is reproduced from the chunks of compressed data.
  - 29. The method of claim 28, wherein decompressing each chunk of compressed data, further comprises enabling decompression of each chunk of compressed data that includes an item and not decompressing each chunk of compressed data that does not include the item, so that random access to the item in the compressed data is provided to a user.
  - 30. The method of claim 29, wherein the dividing of the sequence of the data into a series of consecutive blocks, further comprises selecting a size for the predetermined amount of data in each block so that random access to the item is optimized.
  - 31. The method of claim 26, wherein the subsequent occurrence of the pattern in the block of data that has initially occurred at the previous location in the block of data has a length of at least three bytes.
  - 32. The method of claim 26, wherein the copy token further comprises an offset field that identifies an offset between the subsequent occurrence of the pattern in the block of data and the initial occurrence of the pattern in the block of data.
  - 33. The method of claim 32, wherein the offset field contains the minimum amount of bits required to encode the offset between a start of the subsequent occurrence of the pattern in the block of data and a start of the block of data.
  - 34. The method of claim 26, wherein sequentially examining data in the block to identify a pattern of data further comprises:
    - (a) calculating a hash value of the pattern using a hash function;
      
      (b) using the hash value as an index to a hash table to locate a hash table entry; and
      
      (c) examining a pointer in the hash table entry to locate the occurrence of the pattern at the previous location in the block of data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Kimura, Gary D.
Primary Examiner(s)
Hoff, Marc
Assistant Examiner(s)
JEAN PIERRE, PEGUY

Application Number

US08/827,926
Time in Patent Office

1,350 Days
Field of Search

341/50, 341/51, 341/87, 341/65, 341/67, 341/106, 341/107
US Class Current

341/87
CPC Class Codes

H03M 7/3086 employing a sliding window,...

Method and system for compression and decompression using variable-sized offset and length fields

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

34 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for compression and decompression using variable-sized offset and length fields

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

34 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links