System and method for providing lossless compression of n-gram language models in a real-time decoder

US 6,092,038 A
Filed: 02/05/1998
Issued: 07/18/2000
Est. Priority Date: 02/05/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A method for losslessly compressing an n-gram language model for storage in a storage device, the n-gram language model comprising a plurality of n-gram records generated from a training vocabulary, each n-gram record comprising an n-gram in the form of a series of "n-tuple" words (w1, w2, . . . wn), a count and a probability associated therewith, each n-gram having a history represented by the initial n-1 words of the n-gram, said method comprising the steps of:

splitting said plurality of n-gram records into (i) a set of common history records comprising subsets of n-tuple words having a common history and (ii) sets of hypothesis records that are associated with the common history records, each set of hypothesis records including at least one hypothesis record comprising a word record-probability record pair;

partitioning said common history records into at least a first group and a second group, said first group comprising each common history record having a single hypothesis record associated therewith, said second group comprising each common history record having more than one hypothesis record associated therewith;

storing said hypothesis records associated with said second group of common history records in said storage device; and

storing, in an index portion of said storage device, (i) each common history record of said second group together with an address that points to a location in said storage device having corresponding hypothesis records and (ii) each common history record of said first group together with its corresponding single hypothesis record.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

System and methods for compressing (losslessly) n-gram language models for use in real-time decoding, whereby the size of the model is significantly reduced without increasing the decoding time of the recognizer. Lossless compression is achieved using various techniques. In one aspect, n-gram records of an N-gram language model are split into (i) a set of common history records that include subsets of n-tuple words having a common history and (ii) sets of hypothesis records that are associated with the common history records. The common history records are separated into a first group of common history records each having only one hypothesis record associated therewith and a second group of common history records each having more than one hypothesis record associated therewith. The first group of common history records are stored together with their corresponding hypothesis record in an index portion of a memory block comprising the N-gram language model and the second group of common history records are stored in the index together with addresses pointing to a memory location having the corresponding hypothesis records. Other compression techniques include, for instance, mapping word records of the hypothesis records into word numbers and storing a difference value between subsequent word numbers; segmenting the addresses and storing indexes to the addresses in each segment to multiples of the addresses; storing word records and probability records as fractions of bytes such that each pair of word-probability records occupies a multiple of bytes and storing flags indicating the length; and storing the probability records as indexes to sorted count values that are used to compute the probability on the run.

Citations

32 Claims

1. A method for losslessly compressing an n-gram language model for storage in a storage device, the n-gram language model comprising a plurality of n-gram records generated from a training vocabulary, each n-gram record comprising an n-gram in the form of a series of "n-tuple" words (w1, w2, . . . wn), a count and a probability associated therewith, each n-gram having a history represented by the initial n-1 words of the n-gram, said method comprising the steps of:
- splitting said plurality of n-gram records into (i) a set of common history records comprising subsets of n-tuple words having a common history and (ii) sets of hypothesis records that are associated with the common history records, each set of hypothesis records including at least one hypothesis record comprising a word record-probability record pair;
  
  partitioning said common history records into at least a first group and a second group, said first group comprising each common history record having a single hypothesis record associated therewith, said second group comprising each common history record having more than one hypothesis record associated therewith;
  
  storing said hypothesis records associated with said second group of common history records in said storage device; and
  
  storing, in an index portion of said storage device, (i) each common history record of said second group together with an address that points to a location in said storage device having corresponding hypothesis records and (ii) each common history record of said first group together with its corresponding single hypothesis record.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1, wherein n=3, wherein the step of splitting comprises the step of generating common history records and hypothesis records for each w2 word of the plurality of n-grams, and wherein the hypothesis records comprise (i) bigram records having a common history w2 and (ii) trigram records having a common history w1 w2, the method further comprising the step of generating an index block for each w2 word, the index block comprising a plurality of entries, each entry (i) corresponding to a different w1 word having a common history with the w2 word and (ii) having an address that points to a block of trigram records having different w3 words but a common history w1 and w2.
  - 3. The method of claim 1, further comprising the steps of:
    - mapping words of said n-gram records into word numbers based on a frequency of occurrence of said words in said training vocabulary;
      
      sorting the word-probability records in each of said sets of hypothesis records based on the word numbers corresponding to the word records;
      
      calculating a difference between subsequent ones of the sorted word numbers of said word records; and
      
      storing said differences to represent said word records.
  - 4. The method of claim 3, wherein the step of mapping comprises the steps of (i) assigning an integer value to each unique word in the training vocabulary comprising N words from 1 to N such that a most frequently occurring word in the training vocabulary is assigned a integer value of 1 and the least frequently occurring word in the training vocabulary is assigned value N, and (ii) sorting said word numbers in descending frequency from integer value 1 to integer value N, wherein said word-probability records for each subset of hypothesis records are sorted in an increasing order of the integer values of the word numbers corresponding to said word records.
  - 5. The method of claim 1, further comprising the steps of:
    - partitioning said addresses in said index portion into i segments;
      
      mapping each address in each of the i segments to an index number; and
      
      storing said index numbers in said index portion to represent said addresses.
  - 6. The method of claim 5, wherein each of the i segments comprise no more than 65,536 addresses, and wherein the index numbers of the i-th segment are equal to A-(i1) * 65,536 where A represents the actual address in the i-th segment.
  - 7. The method of claim 6, further comprising the steps of storing a table of multiples comprising values that are used for determining the actual address value of an index number of an i-th segment, wherein the actual address value of the index number of the i-th segment is determined by adding a value equal to i*65,536 to said index number.
  - 8. The method of claim 1, further comprising the steps of:
    - computing a count value for each count associated with the n-gram records, said count value being equal to log10(count);
      
      sorting each unique count value in descending order and generating an index to the sorted count values; and
      
      storing an index to said count values in said hypothesis records to represent said probability records.
  - 9. The method of claim 8, further comprising the step of determining a series of lambda weights for said probability records on the run from said stored count values.
  - 10. The method of claim 8, further comprising the steps of computing a probability value on the run for a given hypothesis record by accessing the count values associated with said stored index and computing the difference between the log10 count values for the corresponding probability record.
  - 11. The method of claim 1, further comprising the steps of:
    - splitting each of said sets of hypothesis records into a first set and a second set, said first set containing the first n records of the set of hypothesis records and said second set containing the remainder of the set of hypothesis records;
      
      storing an A flag in the index portion of said storage device for each of said first sets of n records to mark a byte length for each of said n records; and
      
      storing a B flag in the word records and probability records of each second set of remaining hypothesis records to mark a byte length for each of said remaining records.
  - 12. The method of claim 11, wherein said B flags are stored such that the length of said B flags plus said word records and probability records occupy a multiple of bytes.
  - 13. The method of claim 1, wherein each set of hypothesis records corresponds to at least one of the subsets of n-tuple words having a common history, and wherein each common history record is stored together with its corresponding sets of hypothesis records and a unique index block having the addresses that point to locations of the corresponding hypothesis records, in a contiguous block of memory locations of the storage device.
  - 14. The method of claim 1, further comprising the step of storing at the beginning of the index portion of said storage device, a first parameter and a second parameter for determining, respectively, (i) a block size of said index portion comprising said first group of common history records and said addresses and (ii) a block size of said index portion comprising said second group of common history records and corresponding hypothesis records.
  - 15. The method of claim 1, wherein at least one of the pairs of word-probability records have the word record and probability record stored in fractions of bytes such that the at least one hypothesis record is stored as a multiple of a byte.
  - 16. The method of claim 15, further comprising the step of storing a flag for the at least one hypothesis record to indicate its byte length.

17. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform method steps for losslessly compressing an n-gram language model for storage in a storage device, the n-gram language model comprising a plurality of n-gram records generated from a training vocabulary, each n-gram record comprising an n-gram in the form of a series of "n-tuple" words (w1, w2, . . . wn), a count and a probability associated therewith, each n-gram having a history represented by the initial n-1 words of the n-gram, said method comprising the steps of:
- splitting said plurality of n-gram records into (i) a set of common history records comprising subsets of n-tuple words having a common history and (ii) sets of hypothesis records that are associated with the common history records, each set of hypothesis records including at least one hypothesis record comprising a word record-probability record pair;
  
  partitioning said common history records into at least a first group and a second group, said first group comprising each common history record having a single hypothesis record associated therewith, said second group comprising each common history record having more than one hypothesis record associated therewith;
  
  storing said hypothesis records associated with said second group of common history records in said storage device; and
  
  storing, in an index portion of said storage device, (i) each common history record of said second group together with an address that points to a location in said storage device having corresponding hypothesis records and (ii) each common history record of said first group together with its corresponding single hypothesis record.
- View Dependent Claims (18, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
- - 18. The program storage device of claim 17, wherein n=3, wherein the instructions for performing the step of splitting comprise instructions for performing the step of generating common history records and hypothesis records for each w2 word of the plurality of n-grams, and wherein the hypothesis records comprise (i) bigram records having a common history w2 and (ii) trigram records having a common history w1 w2, the program storage device further comprising instructions for performing the step of generating an index block for each w2 word, the index block comprising a plurality of entries, each entry (i) corresponding to a different w1 word having a common history with the w2 word and (ii) having an address that points to a block of trigram records having different w3 words but a common history w1 and w2.
  - 21. The program storage device of claim 17, further comprising instructions for performing the steps of:
    - partitioning said addresses in said index portion into i segments;
      
      mapping each address in each of the i segments to an index number; and
      
      storing said index numbers in said index portion to represent said addresses.
  - 22. The program storage device of claim 21, wherein each of the i segments comprise no more than 65,536 addresses, and wherein the index numbers of the i-th segment are equal to A-(i-1) * 65,536 where A represents the actual address in the i-th segment.
  - 23. The program storage device of claim 22, further comprising instructions for performing the step of storing a table of multiples comprising values that are used for determining an actual address value of an index number of an i-th segment, wherein the actual address value of the index number of the i-th segment is determined by adding a value equal to i*65,536 to said index number.
  - 24. The program storage device claim 17, further comprising instructions for performing the steps of:
    - computing a count value for each count associated with the n-gram records, said count value being equal to log10(count);
      
      sorting each unique count value in descending order and generating an index to the sorted count values; and
      
      storing an index to said count values in said hypothesis records to represent said probability records.
  - 25. The program storage device of claim 24, further comprising instructions for performing the steps of computing a probability value on the run for a given hypothesis record by accessing the count values associated with said stored index and computing the difference between the log10 count values for the corresponding probability record.
  - 26. The program storage device of claim 24, further comprising instructions for performing the step of determining a series of lambda weights for said probability records on the run from said stored count values.
  - 27. The program storage device of claim 17, further comprising instructions for performing the steps of:
    - splitting each of said sets of hypothesis records into a first set and a second set, said first set containing the first n records of the set of hypothesis records and said second set containing the remainder of the set of hypothesis records;
      
      storing an A flag in the index portion of said storage device for each of said first sets of n records to mark a byte length for each of said n records; and
      
      storing a B flag in the word records and probability records of each second set of remaining hypothesis records to mark a byte length for each of said remaining records.
  - 28. The program storage device of claim 27, wherein said B flags are stored such that the length of said B flags plus said word records and probability records occupy a multiple of bytes.
  - 29. The program storage device of claim 17, wherein each set of hypothesis records corresponds to at least one of the subsets of n-tuple words having a common history, and wherein each common history record is stored together with its corresponding sets of hypothesis records and a unique index block having the addresses that point to locations of the corresponding hypothesis record, in a contiguous block of memory locations of the storage device.
  - 30. The program storage device of claim 17, further comprising instructions for performing the step of storing at the beginning of the index portion of said storage device, a first parameter and a second parameter for determining, respectively, (i) a block size of said index portion comprising said first group of common history records and said addresses and (ii) a block size of said index portion comprising said second group of common history records and corresponding hypothesis records.
  - 31. The program storage device of claim 17, wherein at least one of the pairs of word- probability records have the word record and probability record stored in fractions of bytes such that the at least one hypothesis record is stored as a multiple of a byte.
  - 32. The program storage device of claim 31, further comprising instructions for performing the step of storing a flag for the at least one hypothesis record to indicate its byte length.

19. The program storage device of 17 further comprising instructions for performing the steps of:
- mapping words of said n-gram records into word numbers based on a frequency of occurrence of said words in said training vocabulary;
  
  sorting the word-probability records in each of said sets of hypothesis records based on the word numbers corresponding to the word records;
  
  calculating a difference between subsequent ones of the sorted word numbers of said word records; and
  
  storing said differences to represent said word records.
- View Dependent Claims (20)
- - 20. The program storage device of claim 19, wherein the instructions for performing the step of mapping comprise instructions for performing the steps of (i) assigning an integer value to each unique word in the training vocabulary comprising N words from 1 to N such that a most frequently occurring word in the training vocabulary is assigned a integer value of 1 and the least frequently occurring word in the training vocabulary is assigned value N, and (ii) sorting said word numbers in descending frequency from integer value 1 to integer value N, wherein said word-probability records for each subset of hypothesis records are sorted in an increasing order of the integer values of the word numbers corresponding to said word records.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Rao, Srinivasa Patibandla, Kanevsky, Dimitri
Primary Examiner(s)
Thomas, Joseph

Application Number

US09/019,012
Time in Patent Office

894 Days
Field of Search

704/9, 704/10, 704/1, 704/251, 704/255, 704/256, 704/257, 704/243, 704/270, 704/277, 707/530, 707/531, 707/532, 707/533, 707/534, 707/104
US Class Current

704/9
CPC Class Codes

H03M 7/30 Compression speech analysis...

System and method for providing lossless compression of n-gram language models in a real-time decoder

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

32 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for providing lossless compression of n-gram language models in a real-time decoder

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

32 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links