Text file compression system utilizing word terminators
First Claim
1. A method for compressing a text file representing a character-based document, the text file comprising a succession of character bytes, wherein each character byte is a collection of bits having a binary value representing a character, the method comprising the steps of:
- identifying some types of said characters bytes as word terminators such that said text file may be treated as a sequence of words, wherein each word is a sequence of character bytes beginning other than with a word terminator and including one or more word terminators only as ending characters thereof;
generating a main dictionary comprising a plurality of entries, each main dictionary entry containing a unique dictionary word such that for each word of the text file there is a main dictionary entry containing a matching main dictionary word; and
generating data identifying a sequence of said main dictionary entries matching said sequence of word.
7 Assignments
0 Petitions
Accused Products
Abstract
A system for compressing an ASCII or similarly encoded text file creates an alphabetically ordered main dictionary listing all unique words appearing in the text file. A text file "word" is defined as a sequence of characters ending with one or more "word terminators" such as spaces, commas, periods and carriage returns. The compression system also creates a common word dictionary referencing words most often encountered in the text file. The sequence of words forming the text file is represented by a word index, a list of one byte and two byte references to common and main dictionary words, respectively. The system compresses the main dictionary using three complementary techniques. First, leading characters of each dictionary word matching leading characters of a next preceding dictionary word are represented by data indicating the number of matching characters. Second, commonly encountered dictionary word suffixes are represented by data referencing entries of a small suffix dictionary. Third, remaining characters of main dictionary words are represented by bytes encoded to represent commonly encountered characters and groups of characters. The system also compresses style data structures often included in word processing text files.
372 Citations
35 Claims
-
1. A method for compressing a text file representing a character-based document, the text file comprising a succession of character bytes, wherein each character byte is a collection of bits having a binary value representing a character, the method comprising the steps of:
-
identifying some types of said characters bytes as word terminators such that said text file may be treated as a sequence of words, wherein each word is a sequence of character bytes beginning other than with a word terminator and including one or more word terminators only as ending characters thereof; generating a main dictionary comprising a plurality of entries, each main dictionary entry containing a unique dictionary word such that for each word of the text file there is a main dictionary entry containing a matching main dictionary word; and generating data identifying a sequence of said main dictionary entries matching said sequence of word. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method for compressing a text file representing a character-based document, the text file comprising a succession of character bytes, wherein each character byte is a collection of bits having a binary value representing a character, the method comprising the steps of:
-
identifying some types of said characters bytes as word terminators such that said text file may be treated as a sequence of words, wherein each word is a sequence of character bytes beginning other than with a word terminator and including one or more word terminators only as ending characters thereof; generating a main dictionary comprising a plurality of entries, each main dictionary entry containing a unique dictionary word such that for each word of the text file there is main dictionary entry containing a matching main dictionary word, the main dictionary entries being ordered so as to maximize a number of leading character bytes a word of each main dictionary entry, other than a first main dictionary entry, has in common with a word contained in its next preceding main dictionary entry; generating a common word dictionary comprising a plurality of common word entries, each common word entry containing a reference to a separate one of said main dictionary entries; generating a word index comprising a sequence of references to common word dictionary entries and to main dictionary entries, wherein each reference to a main dictionary entry consists of upper and lower bytes having a collective value identifying a main dictionary entry and wherein each reference to a common word dictionary entry consists of one byte having a value identifying a common word dictionary entry wherein upper bytes of all references to main dictionary entries have values within a first set of values, wherein one byte references to all common word dictionary entries have values within a second set of values, and wherein said first and second sets of values are non-overlapping. - View Dependent Claims (14, 15, 16, 17, 18)
-
-
19. A method for compressing a text file representing a character-based document, the text file including a succession of character bytes, wherein each character byte is a collection of bits having a binary value representing a character, the text file also including a first style data structure comprising a list of corresponding position and style data values, wherein each position data value indicates a number of characters from a first character of said document at which a character style change occurs and wherein the corresponding style data value indicates a character style to which a change is made, the method comprising the steps of:
-
generating data identifying a sequence of said main dictionary entries matching said sequence of words; generating a style dictionary comprising a plurality of entries, each style data dictionary entry containing a unique style data value such that for each style data value of the first style data structure there is style data dictionary entry containing a matching style data value; and generating a second style data structure comprising a list of corresponding distance and style index data values derived from the position and style data values of the first style data structure, wherein each distance data value indicates a number of characters between one style change and a next style change in said document and wherein its corresponding style index data value references a style dictionary entry. - View Dependent Claims (20)
-
-
21. A method for transmitting a text file representing a character-based document from a first computer to a second computer, the text file comprising a succession of character bytes, wherein each character byte is a collection of bits having a binary value representing a character, the method comprising the steps of:
-
said first computer identifying some types of said characters bytes as word terminators such that said text file may be treated as a sequence of words, wherein each word is a sequence of character bytes beginning other than with a word terminator and including one or more word terminators only as ending characters thereof; said first computer generating a main dictionary comprising a plurality of main dictionary entries, each main dictionary entry containing a unique dictionary word such that for each word of the text file there is main dictionary entry containing a matching main dictionary word; said first computer generating a common word dictionary comprising a plurality of common word entries, each common word entry containing a reference to a separate one of said main dictionary entries; said first computer generating a word index including references to said main dictionary entries and to said common dictionary entries; said first computer generating a compressed main dictionary wherein for each main dictionary entry containing a dictionary word there is a compressed main dictionary entry containing data representing the dictionary word in a more compact form than the dictionary word itself; said first computer transmitting said compressed main dictionary, said common word dictionary, and said word index to said second computer; and said second computer recreating said text file in response to said compressed main dictionary, said common word dictionary, and word index. - View Dependent Claims (22, 23, 24, 25, 26)
-
-
27. An apparatus for compressing a text file comprising
means for creating a main dictionary listing all unique words of the text file, means for creating a common word dictionary referencing the most commonly encountered words in the text file, wherein each word is a sequence of character bytes beginning other than with a word terminator and including one or more word terminators only as ending characters thereof, and means for creating a word index listing references to common and main dictionary words.
-
31. A method for generating a compressed data file representing a text file in more compact form, the text file comprising a first sequence of words, each word formed by at least one text character, the method comprising the steps of:
-
generating a dictionary comprising a plurality of entries, each dictionary entry defining a unique word of the text file; storing in said compressed data file a first type code and a first length code, storing said dictionary in said compressed data file following said first type code and said first length code, wherein said first type code indicates said dictionary follows, and wherein said first length code indicates a length of said dictionary; generating a word index comprising a second sequence of reference numbers, a reference number at each position of said second sequence referencing a dictionary entry defining a correspondingly positioned word of said first sequence; storing in said compressed data file a second type code and a second length code; and storing said word index in said compressed data file following said second type code and said second length code, wherein said second type code indicates that said word index follows, and wherein said second length code indicates a length of said word index. - View Dependent Claims (32, 33, 34, 35)
-
Specification