Apparatus and method for forming a filtered inflected language model for automatic speech recognition

US 6,073,091 A
Filed: 08/06/1997
Issued: 06/06/2000
Est. Priority Date: 08/06/1997
Status: Expired due to Fees

First Claim

Patent Images

1. A method of forming a language model for a language having a selected vocabulary of word forms, the method comprising the steps of:

(a) mapping the word forms into integer vectors in accordance with frequencies of word form occurrence;

(b) partitioning the integer vectors into subsets, the subsets respectively having ranges of frequencies of word form occurrence associated therewith, the subsets being arranged in a descending order of ranges;

(c) respectively assigning maps to the subsets;

(d) filtering a textual corpora using the maps assigned to the subsets in order to generate indexed integers;

(e) determining n-gram statistics for the indexed integers;

(f) estimating n-gram language model probabilities from the n-gram statistics to form the language model; and

(g) determining a probability of a word sequence uttered by a speaker, using said language model.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of forming a language model for a language having a selected vocabulary of word forms comprises: (a) mapping the word forms into integer vectors in accordance with frequencies of word form occurrence; (b) partitioning the integer vectors into subsets, the subsets respectively having ranges of frequencies of word form occurrence associated therewith, the subsets being arranged in a descending order of frequency ranges; (c) respectively assigning maps to the subsets; (d) filtering a textual corpora using the maps assigned to the subsets in order to generate indexed integers; (e) determining n-gram statistics for the indexed integers; and (f) estimating n-gram language model probabilities from the n-gram statistics to form the language model.

103 Citations

27 Claims

1. A method of forming a language model for a language having a selected vocabulary of word forms, the method comprising the steps of:
- (a) mapping the word forms into integer vectors in accordance with frequencies of word form occurrence;
  
  (b) partitioning the integer vectors into subsets, the subsets respectively having ranges of frequencies of word form occurrence associated therewith, the subsets being arranged in a descending order of ranges;
  
  (c) respectively assigning maps to the subsets;
  
  (d) filtering a textual corpora using the maps assigned to the subsets in order to generate indexed integers;
  
  (e) determining n-gram statistics for the indexed integers;
  
  (f) estimating n-gram language model probabilities from the n-gram statistics to form the language model; and
  
  (g) determining a probability of a word sequence uttered by a speaker, using said language model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein the mapping step further includes attributing word numbers to the word forms and then mapping the word numbers into integer vectors including integer quotients and integer residues.
  - 3. The method of claim 1, wherein the mapping step further includes applying integer quotient and integer division to a last component of an integer vector thereby generating a longer integer vector.
  - 4. The method of claim 1, wherein the mapping step further includes:
    - splitting the word forms into stems and endings;
      
      numerating the stems and endings in accordance with their respective frequencies of occurrence; and
      
      assigning the split word forms to corresponding stem and ending numbers.
  - 5. The method of claim 1, wherein the mapping step further includes:
    - clustering the word forms into classes, the classes having members associated therewith;
      
      numerating the classes and the members in the classes; and
      
      assigning the clustered word forms to corresponding class and member numbers.
  - 6. The method of claim 1, wherein the partitioning step further includes determining the range of frequencies from frequency scores estimated from a textual corpora.
  - 7. The method of claim 6, wherein a first subset having a highest frequency score includes word forms mapped into one-dimensioned vectors.
  - 8. The method of claim 7, wherein a second subset having a next highest frequency score includes word forms mapped into two-dimensional vectors.
  - 9. The method of claim 8, wherein an n-th subset having an n-th highest frequency score includes word forms mapped into n-dimensional vectors.
  - 10. The method of claim 1, wherein if a word in the textual corpora has a high frequency of occurrence, then the word is mapped to a word number, else if the word has a relatively low frequency of occurrence, then the word is split and mapped to two word numbers.
  - 11. The method of claim 1, wherein the determining step further includes determining the n-gram statistics for bigram and trigram tuples of indexed integers.
  - 12. The method of claim 1, wherein the determining step further includes determining the n-gram statistics for n-tuples of indexed vectors.
  - 13. The method of claim 1, wherein the estimating step further includes generating a probability score for a tuple of word forms as a product of corresponding unigram, bigram and trigram tuples of indexed integers.

14. Apparatus for forming a language model for a language having a selected vocabulary of word forms, the apparatus comprising:
- means for mapping the word forms into integer vectors in accordance with frequencies of word form occurrence;
  
  means for partitioning the integer vectors into subsets, the subsets respectively having ranges of frequencies of word form occurrence associated therewith, the subsets being arranged in a descending order of ranges;
  
  means for respectively assigning maps to the subsets;
  
  means for filtering a textual corpora using the maps assigned to the subsets in order to generate indexed integers;
  
  means for determining n-gram statistics for the indexed integers;
  
  means for estimating n-gram language model probabilities from the n-gram statistics to form the language model; and
  
  means for determining a probability of a word sequence uttered by a speaker, using said language model.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 15. The apparatus of claim 14, wherein the mapping means further includes means for attributing word numbers to the word forms and then mapping the word numbers into integer vectors including integer quotients and integer residues.
  - 16. The apparatus of claim 14, wherein the mapping means further includes means for applying integer quotient and integer division to a last component of an integer vector thereby generating a longer integer vector.
  - 17. The apparatus of claim 14, wherein the mapping means further includes:
    - means for splitting the word forms into stems and endings;
      
      means for numerating the stems and endings in accordance with their respective frequencies of occurrence; and
      
      means for assigning the split word forms to corresponding stem and ending numbers.
  - 18. The apparatus of claim 14, wherein the mapping means further includes:
    - means for clustering the word forms into classes, the classes having members associated therewith;
      
      means for numerating the classes and the members in the classes; and
      
      means for assigning the clustered word forms to corresponding class and member numbers.
  - 19. The apparatus of claim 14, wherein the partitioning means further includes means for determining the range of frequencies from frequency scores estimated from a textual corpora.
  - 20. The apparatus of claim 19, wherein a first subset having a highest frequency score includes word forms mapped into one-dimensional vectors.
  - 21. The apparatus of claim 20, wherein a second subset having a next highest frequency score includes word forms mapped into two-dimensional vectors.
  - 22. The apparatus of claim 21, wherein an n-th subset having an n-th highest frequency score includes word forms mapped into n-dimensional vectors.
  - 23. The apparatus of claim 14, wherein if a word in the textual corpora has a high frequency of occurrence, then the word is mapped to a word number, else if the word has a relatively low frequency of occurrence, then the word is split and mapped to two word numbers.
  - 24. The apparatus of claim 14, wherein the determining means further includes means for determining the n-gram statistics from bigram and trigram tuples of indexed integers.
  - 25. The apparatus of claim 14, wherein the determining means further includes means for determining the n-gram statistics for n-tuples of indexed vectors.
  - 26. The apparatus of claim 14, wherein the estimating means further includes means for generating a probability score for a tuple of word forms as a product of corresponding unigram, bigram and trigram tuples of indexed integers.

27. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for forming a language model for a language having a selected vocabulary of word forms, the method comprising the steps of:
- (a) mapping the word forms into integer vectors in accordance with frequencies of word form occurrence;
  
  (b) partitioning the integer vectors into subsets, the subsets respectively having ranges of frequencies of word form occurrence associated therewith, the subsets being arranged in a descending order of ranges;
  
  (c) respectively assigning maps to the subsets;
  
  (d) filtering a textual corpora using the maps assigned to the subsets in order to generate indexed integers;
  
  (e) determining n-gram statistics for the indexed integers;
  
  (f) estimating n-gram language model probabilities from the n-gram statistics to form the language model; and
  
  (g) determining a probability of a word sequence uttered by a speaker, using said language model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Kanevsky, Dimitri, Monkowski, Michael Daniel, Sedivy, Jan
Primary Examiner(s)
Isen, Forester W.
Assistant Examiner(s)
EDOUARD, PATRICK NESTOR

Application Number

US08/906,812
Time in Patent Office

1,035 Days
Field of Search

704/1, 704/9, 704/240, 704/243, 704/255, 704/256, 704/257, 707/530, 707/531
US Class Current

704/9
CPC Class Codes

G10L 15/197 Probabilistic grammars, e.g...

Apparatus and method for forming a filtered inflected language model for automatic speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

103 Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Apparatus and method for forming a filtered inflected language model for automatic speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

103 Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links