Text file compression system utilizing word terminators

US 5,999,949 A
Filed: 03/14/1997
Issued: 12/07/1999
Est. Priority Date: 03/14/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method for compressing a text file representing a character-based document, the text file comprising a succession of character bytes, wherein each character byte is a collection of bits having a binary value representing a character, the method comprising the steps of:

identifying some types of said characters bytes as word terminators such that said text file may be treated as a sequence of words, wherein each word is a sequence of character bytes beginning other than with a word terminator and including one or more word terminators only as ending characters thereof;

generating a main dictionary comprising a plurality of entries, each main dictionary entry containing a unique dictionary word such that for each word of the text file there is a main dictionary entry containing a matching main dictionary word; and

generating data identifying a sequence of said main dictionary entries matching said sequence of word.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for compressing an ASCII or similarly encoded text file creates an alphabetically ordered main dictionary listing all unique words appearing in the text file. A text file "word" is defined as a sequence of characters ending with one or more "word terminators" such as spaces, commas, periods and carriage returns. The compression system also creates a common word dictionary referencing words most often encountered in the text file. The sequence of words forming the text file is represented by a word index, a list of one byte and two byte references to common and main dictionary words, respectively. The system compresses the main dictionary using three complementary techniques. First, leading characters of each dictionary word matching leading characters of a next preceding dictionary word are represented by data indicating the number of matching characters. Second, commonly encountered dictionary word suffixes are represented by data referencing entries of a small suffix dictionary. Third, remaining characters of main dictionary words are represented by bytes encoded to represent commonly encountered characters and groups of characters. The system also compresses style data structures often included in word processing text files.

372 Citations

35 Claims

1. A method for compressing a text file representing a character-based document, the text file comprising a succession of character bytes, wherein each character byte is a collection of bits having a binary value representing a character, the method comprising the steps of:
- identifying some types of said characters bytes as word terminators such that said text file may be treated as a sequence of words, wherein each word is a sequence of character bytes beginning other than with a word terminator and including one or more word terminators only as ending characters thereof;
  
  generating a main dictionary comprising a plurality of entries, each main dictionary entry containing a unique dictionary word such that for each word of the text file there is a main dictionary entry containing a matching main dictionary word; and
  
  generating data identifying a sequence of said main dictionary entries matching said sequence of word.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method in accordance with claim 1 further comprising the step ofgenerating a common word dictionary comprising a plurality of common word entries, each common word entry containing a reference to a separate one of said main dictionary entries.
  - 3. The method in accordance with claim 2 wherein the step of generating data identifying said sequence of main dictionary entries comprises the step of generating a word index comprising a sequence of references to common word dictionary entries and to main dictionary entries.
  - 4. The method in accordance with claim 3 wherein each reference to a main dictionary entry consists of an upper byte and a lower byte having a collective value identifying a main dictionary entry and wherein each reference to a common word dictionary entry consists of one byte having a value identifying a common word dictionary entry.
  - 5. The method in accordance with claim 4wherein upper bytes of all references to main dictionary entries have values within a first set of values,wherein one byte references to all common word dictionary entries have values within a second set of values, andwherein said first and second sets of values are non-overlapping.
  - 6. The method in accordance with claim 1 further comprising the step of generating for each main dictionary entry containing a dictionary word, a compressed main dictionary entry containing data representing the dictionary word in a more compact form than is represented by the main dictionary entry.
  - 7. The method in accordance with claim 1 further comprising the step of ordering said main dictionary entries so as to maximize a number of leading character bytes a word of each main dictionary entry, other than a first main dictionary entry, has in common with a word contained in its next preceding main dictionary entry.
  - 8. The method in accordance with claim 7 further comprising the step of generating a separate compressed dictionary entry corresponding to each main dictionary entry, wherein each entry of the compressed dictionary contains first data indicating a number of character bytes a word contained in a main dictionary preceding the corresponding main dictionary entry has in common with a word contained in the corresponding main dictionary entry.
  - 9. The method in accordance with claim 8 wherein each entry of the compressed dictionary also contains second data indicating whether a word contained in the corresponding main dictionary entry includes one of a limited set of common suffixes, wherein a suffix is a sequence of character bytes in a word immediately preceding its one or more word terminators.
  - 10. The method in accordance with claim 1 further comprising the step of generating a separate compressed dictionary entry corresponding to each main dictionary entry, wherein each of the compressed dictionary entries contains a sequence of data values representing the word contained in its corresponding main dictionary entry, wherein a portion of the data values are encoded to represent individual character bytes and others are encoded to represent sequences of character bytes.
  - 11. The method in accordance with claim 1 further comprising the steps of:
    - ordering said main dictionary entries so as to maximize a number of leading character bytes a word of each main dictionary entry, other than a first main dictionary entry, has in common with a word contained in its next preceding main dictionary entry; and
      
      generating a separate compressed dictionary entry corresponding to each main dictionary entry, wherein entries of the compressed dictionary comprise;
      
      a first data value indicating a number of character bytes a word contained in a main dictionary preceding the corresponding main dictionary entry has in common with a word contained in the corresponding main dictionary entry;
      
      a second data value indicating whether a word contained in the corresponding main dictionary entry incudes one of a limited set of common suffixes, wherein a suffix is a sequence of character bytes in a word immediately preceding its one or more word terminators; and
      
      third data values each encoded to represent individual character bytes of a word contained in a corresponding main dictionary entry.
  - 12. The method in accordance with claim 11 wherein entries of the compressed main dictionary further comprisefourth data values each encoded to represent sequences of character bytes included in a word contained in a corresponding main dictionary entry.

13. A method for compressing a text file representing a character-based document, the text file comprising a succession of character bytes, wherein each character byte is a collection of bits having a binary value representing a character, the method comprising the steps of:
- identifying some types of said characters bytes as word terminators such that said text file may be treated as a sequence of words, wherein each word is a sequence of character bytes beginning other than with a word terminator and including one or more word terminators only as ending characters thereof;
  
  generating a main dictionary comprising a plurality of entries, each main dictionary entry containing a unique dictionary word such that for each word of the text file there is main dictionary entry containing a matching main dictionary word, the main dictionary entries being ordered so as to maximize a number of leading character bytes a word of each main dictionary entry, other than a first main dictionary entry, has in common with a word contained in its next preceding main dictionary entry;
  
  generating a common word dictionary comprising a plurality of common word entries, each common word entry containing a reference to a separate one of said main dictionary entries;
  
  generating a word index comprising a sequence of references to common word dictionary entries and to main dictionary entries, wherein each reference to a main dictionary entry consists of upper and lower bytes having a collective value identifying a main dictionary entry and wherein each reference to a common word dictionary entry consists of one byte having a value identifying a common word dictionary entry wherein upper bytes of all references to main dictionary entries have values within a first set of values, wherein one byte references to all common word dictionary entries have values within a second set of values, and wherein said first and second sets of values are non-overlapping.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The method in accordance with claim 13 further comprising the step of generating a separate compressed dictionary entry corresponding to each main dictionary entry, wherein each entry of the compressed dictionary contains first data indicating a number of character bytes a word contained in a main dictionary preceding the corresponding main dictionary entry has in common with a word contained in the corresponding main dictionary entry.
  - 15. The method in accordance with claim 14 wherein each entry of the compressed dictionary also contains second data indicating whether a word contained in the corresponding main dictionary entry incudes one of a set of suffixes, wherein a suffix is a sequence of character bytes in a word immediately preceding its one or more word terminators.
  - 16. The method in accordance with claim 13 further comprising the step of generating a separate compressed dictionary entry corresponding to each main dictionary entry, wherein each of the compressed dictionary entries contains a sequence of data values representing the word contained in its corresponding main dictionary entry, wherein at least one data value of at least one compressed dictionary entry is encoded to represent sequences of character bytes.
  - 17. The method in accordance with claim 13 further comprising the steps of:
    - generating a separate compressed dictionary entry corresponding to each main dictionary entry, wherein each entry of the compressed dictionary contains a sequence of data values representing the word contained in the corresponding main dictionary entry, wherein said data values of at lest one compressed dictionary entry comprise;
      
      a first data value indicating a number of character bytes a word contained in a main dictionary preceding the corresponding main dictionary entry has in common with a word contained in the corresponding main dictionary entry,a second data value indicating whether a word contained in the corresponding main dictionary entry incudes one of a limited set of common suffixes, wherein a suffix is a sequence of character bytes in a word immediately preceding its one or more word terminators, andthird data values each encoded to represent individual character bytes of a word contained in a corresponding main dictionary entry.
  - 18. The method in accordance with claim 17 wherein said data values of said one compressed dictionary entry further comprisefourth data values each encoded to represent sequences of character bytes included in a word contained in a corresponding main dictionary entry.

19. A method for compressing a text file representing a character-based document, the text file including a succession of character bytes, wherein each character byte is a collection of bits having a binary value representing a character, the text file also including a first style data structure comprising a list of corresponding position and style data values, wherein each position data value indicates a number of characters from a first character of said document at which a character style change occurs and wherein the corresponding style data value indicates a character style to which a change is made, the method comprising the steps of:
- generating data identifying a sequence of said main dictionary entries matching said sequence of words;
  
  generating a style dictionary comprising a plurality of entries, each style data dictionary entry containing a unique style data value such that for each style data value of the first style data structure there is style data dictionary entry containing a matching style data value; and
  
  generating a second style data structure comprising a list of corresponding distance and style index data values derived from the position and style data values of the first style data structure, wherein each distance data value indicates a number of characters between one style change and a next style change in said document and wherein its corresponding style index data value references a style dictionary entry.
- View Dependent Claims (20)
- - 20. The method in accordance with claim 19 further comprising the steps ofidentifying some types of said characters bytes as word terminators such that said text file may be treated as a sequence of words, wherein each word is a sequence of character bytes beginning other than with a word terminator and including one or more word terminators only as ending characters thereof;
    - andgenerating a main dictionary comprising a plurality of entries, each main dictionary entry containing a unique dictionary word such that for each word of the text file there is main dictionary entry containing a matching main dictionary word.

21. A method for transmitting a text file representing a character-based document from a first computer to a second computer, the text file comprising a succession of character bytes, wherein each character byte is a collection of bits having a binary value representing a character, the method comprising the steps of:
- said first computer identifying some types of said characters bytes as word terminators such that said text file may be treated as a sequence of words, wherein each word is a sequence of character bytes beginning other than with a word terminator and including one or more word terminators only as ending characters thereof;
  
  said first computer generating a main dictionary comprising a plurality of main dictionary entries, each main dictionary entry containing a unique dictionary word such that for each word of the text file there is main dictionary entry containing a matching main dictionary word;
  
  said first computer generating a common word dictionary comprising a plurality of common word entries, each common word entry containing a reference to a separate one of said main dictionary entries;
  
  said first computer generating a word index including references to said main dictionary entries and to said common dictionary entries;
  
  said first computer generating a compressed main dictionary wherein for each main dictionary entry containing a dictionary word there is a compressed main dictionary entry containing data representing the dictionary word in a more compact form than the dictionary word itself;
  
  said first computer transmitting said compressed main dictionary, said common word dictionary, and said word index to said second computer; and
  
  said second computer recreating said text file in response to said compressed main dictionary, said common word dictionary, and word index.
- View Dependent Claims (22, 23, 24, 25, 26)
- - 22. The method in accordance with claim 21wherein each reference in said word index to a main dictionary entry consists of upper and lower bytes having a collective value identifying a main dictionary entry and wherein each reference in said word index to a common word dictionary entry consists of one byte having a value identifying a common word dictionary entry,wherein upper bytes of all references to main dictionary entries have values within a first set of values,wherein one byte references to all common word dictionary entries have values within a second set of values, andwherein said first and second sets of values are non-overlapping.
  - 23. The method in accordance with claim 21 further comprising the step of:
    - said first computer ordering said main dictionary entries so as to maximize a number of leading character bytes a word of each main dictionary entry, other than a first main dictionary entry, has in common with a word contained in its next preceding main dictionary entry,wherein each entry of the compressed dictionary contains first data indicating a number of character bytes a word contained in a main dictionary preceding the corresponding main dictionary entry has in common with a word contained in the corresponding main dictionary entry.
  - 24. The method in accordance with claim 23 wherein each entry of the compressed dictionary also contains second data indicating whether a word contained in the corresponding main dictionary entry incudes one of a limited set of common suffixes, wherein a suffix is a sequence of character bytes in a word immediately preceding its one or more word terminators.
  - 25. The method in accordance with claim 24 wherein compressed dictionary contain data values representing words contained in their corresponding main dictionary entries, wherein at least one of said data values is encoded to represent a sequence of character bytes.
  - 26. The method in accordance with claim 21 further comprising the step of said first computer transmitting a decompression program to said second computer with said compressed main dictionary, said common word dictionary, and said word index, said second computer executing said decompression program to carry out the step of recreating said text file in response to said compressed dictionary, said common word dictionary, and word index.

27. An apparatus for compressing a text file comprisingmeans for creating a main dictionary listing all unique words of the text file,means for creating a common word dictionary referencing the most commonly encountered words in the text file,wherein each word is a sequence of character bytes beginning other than with a word terminator and including one or more word terminators only as ending characters thereof, andmeans for creating a word index listing references to common and main dictionary words.
- View Dependent Claims (28, 29, 30)
- - 28. The apparatus in accordance with claim 27 further comprising means for generating a compressed main dictionary wherein for each main dictionary word, there is a corresponding compressed main dictionary entry representing the word in a more compact form than as represented the main dictionary.
  - 29. The apparatus in accordance with claim 28 wherein at least one compressed main dictionary entryrepresents leading characters of a main dictionary word matching leading characters of a next preceding main dictionary word with data indicating the number of matching characters, andrepresents a main dictionary word suffixes with data referencing entries in a suffix dictionary.
  - 30. The apparatus in accordance with claim 29 wherein said compressed main dictionary entry represents a sequence of characters with a single data value.

31. A method for generating a compressed data file representing a text file in more compact form, the text file comprising a first sequence of words, each word formed by at least one text character, the method comprising the steps of:
- generating a dictionary comprising a plurality of entries, each dictionary entry defining a unique word of the text file;
  
  storing in said compressed data file a first type code and a first length code,storing said dictionary in said compressed data file following said first type code and said first length code, wherein said first type code indicates said dictionary follows, and wherein said first length code indicates a length of said dictionary;
  
  generating a word index comprising a second sequence of reference numbers, a reference number at each position of said second sequence referencing a dictionary entry defining a correspondingly positioned word of said first sequence;
  
  storing in said compressed data file a second type code and a second length code; and
  
  storing said word index in said compressed data file following said second type code and said second length code, wherein said second type code indicates that said word index follows, and wherein said second length code indicates a length of said word index.
- View Dependent Claims (32, 33, 34, 35)
- - 32. The method in accordance with claim 31 wherein the step of generating a dictionary comprises the substeps of:
    - generating an ordered list of unique words appearing in the text file, andgenerating an entry of the dictionary corresponding to each word of the ordered list, the entry containing data defining its corresponding word.
  - 33. The method in accordance with claim 32 wherein each entry of said dictionary includes data indicating a number of characters the word the entry defines has in common with a word defined by a preceding entry of said dictionary.
  - 34. The method in accordance with claim 32 further comprising the steps of:
    - storing a third data type code and a third length code in said compressed data file;
      
      storing a word suffix list in said compressed data file after said third type code and said third length code, said word suffix list containing a plurality of entries, each containing a word suffix,wherein said third data type code indicates that said suffix list follows,wherein said third length code indicates a length of said suffix list, andwherein an entry of said dictionary represents a suffix of its defined word by referencing an entry of said word suffix list.
  - 35. The method in accordance with claim 32 further comprising the steps of:
    - storing a fourth data type code and a fourth length code in said compressed data file;
      
      storing a character list in said compressed data file after said fourth type code and said fourth length code, said character list including a plurality of entries, each referencing a unique sequence of text characters;
      
      wherein said fourth data type code indicates that said character translation list follows,wherein said fourth length code indicates a length of said character code list, andwherein entries of said dictionary represent a sequence of text characters by referencing an entry of said character list.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Infinite Ink Corp. (Fidelio Acquisition Co LLC), Intertrust Technologies Corporation (Fidelio Acquisition Co LLC)
Original Assignee
Infinite Ink Corp. (Fidelio Acquisition Co LLC)
Inventors
Crandall, Gary E.
Primary Examiner(s)
Hong, Stephen S.
Assistant Examiner(s)
Bourque, Robert D

Application Number

US08/818,765
Time in Patent Office

998 Days
Field of Search

707/530, 707/531, 707/532, 707/534, 707/101, 707/100, 341/65, 341/59, 382/232, 382/233, 382/244, 382/245, 382/246
US Class Current

715/234
CPC Class Codes

H03M 7/3088 employing the use of a dict...

Y10S 707/99942 Manipulating data structure...

Text file compression system utilizing word terminators

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

372 Citations

35 Claims

Specification

Solutions

Use Cases

Quick Links

Text file compression system utilizing word terminators

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

372 Citations

35 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links