Phoneme based speech recognition

US 5,390,278 A
Filed: 10/08/1991
Issued: 02/14/1995
Est. Priority Date: 10/08/1991
Status: Expired due to Term

First Claim

Patent Images

1. A method of preparing phoneme models for recognition of speech received via telephone lines comprising the steps of:

a) analyzing a training word to generate a frame sequence of acoustic parameter vectors representative thereof and obtaining a phoneme model sequence of the training word;

b) providing a first set of model parameters representative of a sequence of state-transition models corresponding to each phoneme of the phoneme model sequence in the training word, the parameters including a mean vector and weighting factor for each transition and a covariance matrix for each model;

c) computing a set of observation probabilities for each phoneme of the phoneme model sequence the training word and the first set of model parameters;

d) aligning the frame sequence of acoustic parameter vectors with the sequence of state-transition models to provide a mapping therebetween representative of a path through the sequence of state-transition models having a highest likelihood;

e) accumulating statistics for a plurality of utterances of said training word using the mapping of step d);

f) generating a second set of model parameters representative of the sequence of state-transition models corresponding to each phoneme of the phoneme model sequence the training word;

g) repeating step d) for the second set of model parameters;

h) comparing the likelihood of the first and second sets of model parameters; and

i) repeating step b) through h), replacing the first set of model parameters by the second set of model parameters when the second set of parameters provides at least a predetermined improvement in likelihood; and

j) substituting the covariance matrix of a first model with the covariance matrix of a second model to provide a smooth covariance matrix thereby improving recognition accuracy for the first model.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A flexible vocabulary speech recognition system is provided for recognizing speech transmitted via the public switched telephone network. The flexible vocabulary recognition (FVR) system is a phoneme based system. The phonemes are modelled as hidden Markov models. The vocabulary is represented as concatenated phoneme models. The phoneme models are trained using Viterbi training enhanced by: substituting the covariance matrix of given phonemes by others, applying energy level thresholds and voiced, unvoiced, silence labelling constraints during Viterbi training. Specific vocabulary members, such as digits, are represented by allophone models. A* searching of the lexical network is facilitated by providing a reduced network which provides estimate scores used to evaluate the recognition path through the lexical network. Joint recognition and rejection of out-of-vocabulary words are provided by using both cepstrum and LSP parameter vectors.

198 Citations

28 Claims

1. A method of preparing phoneme models for recognition of speech received via telephone lines comprising the steps of:
- a) analyzing a training word to generate a frame sequence of acoustic parameter vectors representative thereof and obtaining a phoneme model sequence of the training word;
  
  b) providing a first set of model parameters representative of a sequence of state-transition models corresponding to each phoneme of the phoneme model sequence in the training word, the parameters including a mean vector and weighting factor for each transition and a covariance matrix for each model;
  
  c) computing a set of observation probabilities for each phoneme of the phoneme model sequence the training word and the first set of model parameters;
  
  d) aligning the frame sequence of acoustic parameter vectors with the sequence of state-transition models to provide a mapping therebetween representative of a path through the sequence of state-transition models having a highest likelihood;
  
  e) accumulating statistics for a plurality of utterances of said training word using the mapping of step d);
  
  f) generating a second set of model parameters representative of the sequence of state-transition models corresponding to each phoneme of the phoneme model sequence the training word;
  
  g) repeating step d) for the second set of model parameters;
  
  h) comparing the likelihood of the first and second sets of model parameters; and
  
  i) repeating step b) through h), replacing the first set of model parameters by the second set of model parameters when the second set of parameters provides at least a predetermined improvement in likelihood; and
  
  j) substituting the covariance matrix of a first model with the covariance matrix of a second model to provide a smooth covariance matrix thereby improving recognition accuracy for the first model.
- View Dependent Claims (2, 3, 4, 5)
- - 2. A method as claimed in claim 1 wherein the first model represents a left silence phoneme ({) and the second model represents a phoneme (f).
  - 3. A method as claimed in claim 1 wherein the first model represents a right silence phoneme (}) and the second model represents a phoneme (f).
  - 4. A method as claimed in claim 1 wherein the first model represents a phoneme selected from the group consisting of , , , and the second model represents a phoneme ( ).
  - 5. A method as claimed in claim 1 wherein the first model represents a phoneme (ε
    - before r) and the second model represents a phoneme ( ).

6. A method of preparing phoneme models for recognition of speech received via telephone lines comprising the steps of:
- a) analyzing a training word to generate a frame sequence of acoustic parameter vectors representative thereof, having a parameter indicative of energy level for each frame of the frame sequence and obtaining a phoneme model sequence of the training word;
  
  b) providing a first set of model parameters representative of a sequence of state-transition models corresponding to each phoneme of the phoneme model sequence in the training word;
  
  c) computing a set of observation probabilities by, for each frame of the frame sequence and each model of the sequence of state-transition models, comparing the energy level of the frame with a predetermined, relative to noise on the telephone lines, energy threshold for the model, and if the energy level is below the energy threshold, setting the observation probability for the frame to zero, otherwise computing the observation probability for the frame;
  
  d) aligning the frame sequence of acoustic parameter vectors with the sequence of state-transition models to provide a mapping therebetween representative of a path through the sequence of state-transition models having a highest likelihood;
  
  e) accumulating statistics for a plurality of utterances of said training word using the mapping of step d);
  
  f) generating a second set of model parameters representative of the sequence of state-transition models corresponding to the training word;
  
  g) repeating step d) for the second set of model parameters;
  
  h) comparing the likelihood of the first and second sets of model parameters; and
  
  i) repeating step b) through h), replacing the first set of model parameters by the second set of model parameters when the second set of parameters provides at least a predetermined improvement in likelihood.
- View Dependent Claims (7, 8, 9, 10, 11)
- - 7. A method as claimed in claim 6 wherein the model represents a vowel other than ( ) and if the model is one of four first models of the sequence of state-transition models, the energy threshold is 10 dB above background noise;
    - otherwise, if the sequence of state-transition models is greater than 10 models and the model is one of four last models of the sequence, the energy threshold is 1 dB above background noise;
      
      otherwise, the energy is 6 dB above background noise.
  - 8. A method as claimed in claim 6 wherein the model represents a vowel ( ) and if the model is one of four first models of the sequence of state-transition models, the energy threshold is 6 dB above background noise;
    - otherwise, if the sequence of state-transition models is greater than 10 models and the model is one of four last models of the sequence, the energy threshold is 1 dB above background noise;
      
      otherwise, the energy is 3 dB above background noise.
  - 9. A method as claimed in claim 6 wherein the model represents a phoneme selected from the group consisting of l, r, j, w, ∫
    - , and and if the sequence of state-transition models is greater than 10 models and the model is one of four last models of the sequence, the energy threshold is 1 dB above background noise;
      
      otherwise, the energy is 3 dB above background noise.
  - 10. A method as claimed in claim 6 wherein the model represents a phoneme selected from the group consisting of f, v, φ
    - , , and h and the model is one of four first models, the energy threshold is 1 dB above background noise.
  - 11. A method as claimed in claim 6 wherein the model represents a phoneme selected from the group consisting of s, z, n, m, and , the energy threshold is 1 dB above background noise.

12. A method of preparing phoneme models for recognition of speech received via telephone lines comprising the steps of:
- a) analyzing a training word to generate a frame sequence of acoustic parameter vectors representative thereof and to label each frame of the frame sequence as voiced, unvoiced or silence;
  
  b) providing a first set of model parameters representative of a sequence of state-transition models corresponding to the training word, including the step of labelling each model as voiced, unvoiced or silence in dependence upon a phoneme represented by the model and a relative position of the model in the sequence;
  
  c) computing a set of observation probabilities by, for each frame of the frame sequence and each model of the model sequence of state-transition models, comparing a voiced-unvoiced-silence (VUS) label of the frame with a VUS label of the model and if the labels do not match, setting the observation probability for the frame to zero, otherwise computing the observation probability for the frame for the training word and the first set of model parameters;
  
  d) aligning the frame sequence of acoustic parameter vectors with the sequence of state-transition models to provide a mapping therebetween representative of a path through the sequence of state-transition models having a highest likelihood;
  
  e) accumulating statistics for a plurality of utterances of said training word using the mapping of step d);
  
  f) generating a second set of model parameters representative of the sequence of state-transition models corresponding to the training word;
  
  g) repeating step d) for the second set of model parameters;
  
  h) comparing the likelihood of the first and second sets of model parameters; and
  
  i) repeating step b) through h), replacing the first set of model parameters by the second set of model parameters when the second set of parameters provides at least a predetermined improvement in likelihood.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 13. A method as claimed in claim 12 wherein the phoneme represented is intervocalic silence and wherein the step of labelling maps both unvoiced and silence labels to the model.
  - 14. A method as claimed in claim 12 wherein the phoneme represented is (j) and wherein the step of labelling maps both unvoiced and voiced labels to the model.
  - 15. A method as claimed in claim 12 wherein the phoneme represented is selected from the group consisting of (n, m, and ) and wherein if the sequence is greater than 10 models and the model is one of four last models, then the step of labelling maps voiced, unvoiced, and silence labels to the phoneme, otherwise the step of labelling maps both voiced and unvoiced labels to the model.
  - 16. A method as claimed in claim 12 wherein the phoneme represented is selected from the group consisting of (vowel, l, r, and w) and wherein if the model is in an initial position or is one of four last models in a sequence of greater than 10 models then the step of labelling maps both voiced and unvoiced labels to the model, otherwise maps a voiced label to the model.
  - 17. A method as claimed in claim 16 wherein the model represents the phoneme (r) when followed by a vowel and preceded by (t) or (f) and the step of labelling maps both voiced and unvoiced labels to the model.
  - 18. A method as claimed in claim 16 wherein the model represents the phoneme (w) when preceded by (k) and the step of labelling maps both voiced and unvoiced labels to the model.
  - 19. A method as claimed in claim 16 wherein the model represents the phoneme (i) when followed by inter-word silence and the step of labelling maps both voiced and unvoiced labels to the model.
  - 20. A method as claimed in claim 16 wherein the model represents the phoneme (I) when preceded by (d) or inter-word silence and the step of labelling maps both voiced and unvoiced labels to the model.
  - 21. A method as claimed in claim 16 wherein the model represents the phoneme (ε
    - ) when preceded by inter-word silence and the step of labelling maps both voiced and unvoiced labels to the model.
  - 22. A method as claimed in claim 16 wherein the model represents the phoneme (u) when preceded by (j) and the step of labelling maps both voiced and unvoiced labels to the model.
  - 23. A method as claimed in claim 16 wherein the model represents the phoneme ( ) when preceded by (s), (∫
    - ), (r), (d), or inter-word silence and the step of labelling maps both voiced and unvoiced labels to the model.

24. A method of speech recognition for speech received via telephone lines comprising the steps of:
- a) analyzing an unknown utterance to generate a frame sequence of acoustic parameter vectors representative thereof;
  
  b) providing a first network representing a recognition vocabulary, wherein each branch of the first network is a model representing a phoneme and each complete path through the first network is a sequence of models representing a word in the recognition vocabulary;
  
  c) providing a second network derived from the first network, in which all sequences of three consecutive phonemes present in the first network are present;
  
  d) computing transitional probabilities for each node of the second network given the frame sequence of acoustic parameter vectors;
  
  e) searching the second network to determine optimal cumulative probabilities for each node of the second network for all frames of the frame sequence;
  
  f) storing the cumulative probabilities as estimate scores for estimating partial paths in the first network;
  
  g) computing point scores for all phonemes in dependence upon the frame sequence of acoustic parameter vectors;
  
  h) determining a complete path through the first network by evaluating successive one phoneme extensions of partial paths using the estimate scores for the nodes of the second network to find the partial path to extend;
  
  wherein the step of determining includes the step of initiating a search through the first network by;
  
  establishing a stack for recording all paths from the end of the first network;
  
  looking along all branches of the network two phonemes;
  
  obtaining estimate scores for each node of the second network corresponding to each two phoneme branch of the first network;
  
  entering each estimate score into the stack that arranges the scores in descending order;
  
  wherein the step of determining includes the steps of expanding a top entry in the stack by;
  
  obtaining point scores for the first phoneme of the two phoneme branch closest to the end of the first network;
  
  for every exit time in the stack entry and for all possible entrance times for the phoneme determine total actual probability by adding exit scores from the stack entry to point scores for the first phoneme;
  
  computing a new stack entry by adding estimate scores (P_est) for a next two phoneme node of the second network to the total actual probability (P_act) for all possible entrance times, selecting n best total probabilities (P_act +P_est), where n is an integer, and storing the total actual probabilities P_act and frame times for each, together with the best total probability (P_act +P_est) and a phoneme sequence as the new stack entry;
  
  wherein the step of expanding a top entry includes the steps of;
  
  a) storing a least number, q, of frames used in the estimate score for the top entry of the stack;
  
  b) prior to expanding any top entry of the stack, determining the number of frames, r, used in its estimate score; and
  
  c) discarding, from the stack, the top entry when r is greater than the greater of (q+75 and q+s/2) where s is the length of the unknown utterance in frames.
- View Dependent Claims (25, 26)
- - 25. A method as claimed in claim 24 wherein the steps are completed using both cepstral parameters and LSP parameters, the step of providing the word in the recognition vocabulary providing word 1 with a cumulative probability L1(word 1) for the cepstral parameter and word 2 with a cumulative probability L2(word 2) for the LSP parameters and wherein a joint recognition includes the further steps of:
    - if word 1 and word 2 are the same, providing word 1 as the speech recognition output, otherwise;
      
      determining a cumulative probability for word 1 using the LSP parameters (L2(word
      
           1)) and a cumulative probability for word 2 using the cepstral parameters (L1 (word
      
           2));
      
      if L1(word
      
           1)×
      
      L2(word
      
           1) is greater than L1(word
      
           2)×
      
      L2(word
      
           2), providing word 1 as the speech recognition output, otherwise;
      
      providing word 2 as the speech recognition output.
  - 26. A method as claimed in claim 24 wherein the steps are completed using both cepstral parameters and LSP parameters, the step of providing the word in the recognition vocabulary providing word 1 with a cumulative probability L1(word 1) for the cepstral parameter and word 2 with a cumulative probability L2(word 2) for the LSP parameters and wherein a joint recognition includes the further steps of:
    - if word 1 and word 2 are the same;
      
      if the word length is less than 7 phonemes and (L1(word
      
           1)+L2(word
      
           1)) is less than a first threshold T1, reject the unknown utterance as out-of-vocabulary;
      
      otherwise, if the word length is between 6 phonemes and 15 phonemes and (L1(word
      
           1)+L2(word
      
           1)) is less than a second threshold T2, reject the unknown utterances as out-of-vocabulary;
      
      otherwise, if the word length is greater than 14 phonemes and (L1(word
      
           1)+L2(word
      
           1)) is less than a third threshold T3, reject the unknown utterance as out-of-vocabulary;
      
      otherwise, if word 1 and word 2 are different, reject the unknown utterance as out-of-vocabulary.

27. Apparatus for speech recognition, comprising:
- means for analyzing an unknown utterance to generate a frame sequence of acoustic parameter vectors representative thereof;
  
  b) means for providing a first network representing a recognition vocabulary, wherein each branch of the first network is a model representing a phoneme and each complete path through the first network is a sequence of models representing a word in the recognition vocabulary;
  
  c) means for providing a second network derived from the first network, in which all sequences of three consecutive phonemes present in the first network are present;
  
  d) means for computing transitional probabilities for each node of the second network given the frame sequence of acoustic parameter vectors;
  
  e) means for searching the second network to determine optimal cumulative probabilities for each node of the second network for all frames of the frame sequence;
  
  f) means for storing the cumulative probabilities as estimate scores for estimating partial paths in the first network;
  
  g) means for computing point scores for all phonemes in the second network in dependence upon the frame sequence of acoustic parameter vectors; and
  
  h) means for determining a complete path through the first network by evaluating successive one phoneme extensions of partial paths using the estimate scores for the nodes of the second network to find the partial path to extend.

28. Apparatus for providing information via a telephone network, comprising:
- means for accepting a call from an operator via the telephone network;
  
  means for prompting the operator to request information;
  
  means for recognizing speech from the operator to identify a member of a recognition vocabulary, the recognition vocabulary being represented by a first network, wherein each branch represents a phoneme and each complete path through the first network is a sequence of models representing a member of the recognition vocabulary and wherein a second network is derived from the first network, in which all sequences of three consecutive phonemes present in the first network are present;
  
  means for accessing a computerized information source to request information from the computerized information source in dependence upon the member of the recognition vocabulary identified; and
  
  means for synthesizing speech to provide the accessed information to the operator in verbal form.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Bell Canada (BCE Incorporated)
Original Assignee
Bell Canada (BCE Incorporated)
Inventors
Kenny, Patrick J., Lennig, Matthew, Toulson, Christopher K., Gupta, Vishwa N.
Primary Examiner(s)
Knepper, David D.

Application Number

US07/772,903
Time in Patent Office

1,225 Days
Field of Search

381/43, 381/41, 381/42, 379/88, 395/2, 395/2.49, 395/2.5, 395/2.52, 395/2.63-2.65
US Class Current

704/243
CPC Class Codes

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/144   Training of HMMs

G10L 25/24   the extracted parameters be...

Phoneme based speech recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

198 Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Phoneme based speech recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

198 Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links