Method and apparatus for context-dependent estimation of multiple probability distributions of phonetic classes with multilayer perceptrons in a speech recognition system

US 5,317,673 A
Filed: 06/22/1992
Issued: 05/31/1994
Est. Priority Date: 06/22/1992
Status: Expired due to Term

First Claim

Patent Images

1. In a speech recognition apparatus having a hidden Markov model speech recognizer, a method for using a multilayer perceptron (MLP) for recognizing speech by context-dependent estimation of a plurality of state-dependent observation probability distributions of phonetic (phone) classes which has weights that have been obtained based on a training set of speech vectors, wherein said training set of said speech vectors has been used to create context-dependent phone classes for use in said method, said speech vectors being characterized by phone classes, the method comprising the steps of:

applying input speech vectors containing unknown data to a single input layer of a multilayer perceptron, said multilayer perceptron having a single input layer, a single hidden layer, a single set of weights between said input layer and said hidden layer, and a plurality of output layers with an associated plurality of sets of weights between said hidden layer and said output layers, each one of said output layers having a plurality of output units for storing a plurality of probability values;

forward propagating each input speech vector through said multilayer perceptron to produce an activation level representative of a probability value at each output unit within each one of said output layers;

determining likelihood of observing each said input speech vector, assuming a specific state of a hidden Markov model by factoring, according to Bayes rule, said likelihood of observing being in terms of posterior probabilities of phone classes of the speech vector assuming context and the input speech vector, thereby obtaining values representative of context-dependent estimation; and

employing as input to said hidden Markov model speech recognizer said values representative of context-dependent estimation as state-dependent observation probabilities to identify a specific estimated word sequence from said input speech vectors.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a hidden Markov model-based speech recognition system, multilayer perceptrons (MLPs) are used in context-dependent estimation of a plurality of state-dependent observation probability distributions of phonetic classes. Estimation is obtained by the Bayesian factorization of the observation likelihood in terms of posterior probabilities of phone classes assuming the context and the input speech vector. The context-dependent estimation is employed as the state-dependent observation probabilities needed as parameter input to a hidden Markov model speech processor to identify the word sequence representing the unknown speech input of input speech vectors. Within the speech processor, models are provided which employ the observation probabilities in the recognition process. The number of context-dependent nets is reduced to a single net by sharing the units of the input layer and the hidden layer and the weights connecting them in the multilayer perceptron while providing one output layer for each relevant context. Each output layer is trained as an independent network on the specific examples of the corresponding context it represents. Training may be optimized at an intermediate set of weights between the context-independent-associated weights and the context-dependent associated weights to which training would normally converge.

Citations

20 Claims

1. In a speech recognition apparatus having a hidden Markov model speech recognizer, a method for using a multilayer perceptron (MLP) for recognizing speech by context-dependent estimation of a plurality of state-dependent observation probability distributions of phonetic (phone) classes which has weights that have been obtained based on a training set of speech vectors, wherein said training set of said speech vectors has been used to create context-dependent phone classes for use in said method, said speech vectors being characterized by phone classes, the method comprising the steps of:
- applying input speech vectors containing unknown data to a single input layer of a multilayer perceptron, said multilayer perceptron having a single input layer, a single hidden layer, a single set of weights between said input layer and said hidden layer, and a plurality of output layers with an associated plurality of sets of weights between said hidden layer and said output layers, each one of said output layers having a plurality of output units for storing a plurality of probability values;
  
  forward propagating each input speech vector through said multilayer perceptron to produce an activation level representative of a probability value at each output unit within each one of said output layers;
  
  determining likelihood of observing each said input speech vector, assuming a specific state of a hidden Markov model by factoring, according to Bayes rule, said likelihood of observing being in terms of posterior probabilities of phone classes of the speech vector assuming context and the input speech vector, thereby obtaining values representative of context-dependent estimation; and
  
  employing as input to said hidden Markov model speech recognizer said values representative of context-dependent estimation as state-dependent observation probabilities to identify a specific estimated word sequence from said input speech vectors.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method according to claim 1 wherein said determining step is factoring according to Bayes rule given by:
    - ##EQU4## where p(Y|c_k) can be factored as
      space="preserve" listing-type="equation">p(Y c.sub.k)=p(c.sub.k |Y)*p(Y)/p(c.sub.k)
      wherep(Y) is the probability of speech vector Y (which is a term which can be disregarded in the computation, as it will be identical for all recognition paths being compared),p(c_k) is the prior probability of the context class c_k computed by counting in the training set,p(Y|c_k) is the conditional probability of the speech vector Y given context c_k,p(c_k |Y) is the conditional probability of the context c_k given the speech vector Y,p(q_j |Y,c_k) is the context-dependent posterior probability of class j, given speech vector Y and context k,p(Y|q_j,c_k) is the observation likelihood of speech vector Y given class q_j and context c_k,p(q_j |c_k) is the conditional probability of the class q_j given context c_k.
  - 3. The method according to claim 1 further including training said multilayer perceptron to optimize said weights to smoothed values, said smoothed values being intermediate of context-independent posterior probability distributions and context-dependent posterior probability distributions using a cross-validation set.
  - 4. The method according to claim 3 wherein said training step comprises:
    - initializing input-to-hidden layer weights of said multilayer perceptron with input-to-hidden layer weights from a corresponding context-independent network;
      
      initializing each set of context specific hidden-to-output layer weights of said multilayer perceptron with hidden-to-output layer weights from a corresponding context-independent network; and
      
      applying the iterative Backpropagation Algorithm with relative entropy criterion to said context-dependent network by presenting training examples from a training set to said input layer, while forward propagating values representing activations only to that context-specific output layer corresponding to the context of the input speech vector and backpropagating and adjusting only those hidden-to-output layer weights corresponding to the context of the input speech vector until said cross-validation set indicates a localized minimum error of classification of speech vectors into phonetic classes.
  - 5. The method according to claim 3 wherein states of the hidden Markov model directly correspond to units of separate output layers associated with the separate states of each phone, wherein said employing step further comprises stepping through a sequence for each individual phone while referencing different output layers, said different output layers being organized as a sequence.
  - 6. The method according to claim 5 wherein said separate output layers are trained only with input speech vectors in the training set which are aligned with the corresponding state position within a phone so that a sequence of probabilities can be used to represent a phone and that the training of the MLP can be based on discrimination between phones without being based on discrimination between states within a single phone.
  - 7. The method according to claim 6 wherein each phone has a first state, a middle state and a last state, wherein said first state is constrained only by a predecessor context and said last state is constrained only by a following context.
  - 8. The method according to claim 3, wherein said determining step includes scaling said posterior probabilities of phone classes to convert the activation levels of the output units from smoothed context-dependent posterior probabilities to smoothed context-dependent scaled observation likelihoods.
  - 9. The method according to claim 8, wherein the scaling is according to the following relationship:
    - space="preserve" listing-type="equation">p(Y|q.sub.j,c.sub.k)=p(q.sub.j |Y,c.sub.k) * K.sup.k.sub.j,
      whereK^k_j is the scaling factor for class q_j given context c_k,p(q_j |Y,c_k) is the context-dependent posterior probability of class q_j, given speech vector Y and context c_k, andp(Y|q_j,c_k) is the observation likelihood of speech vector Y given class q_j and context c_k.
  - 10. The method according to claim 9, wherein the scaling factor K^k_j from said posterior probabilities to said observation likelihoods is given by:
    - space="preserve" listing-type="equation">K.sup.k.sub.j =α
      
      .sup.k.sub.j /p(q.sub.j)+(1-α
      
      .sup.k.sub.j)* p(c.sub.k |Y)/(p(q.sub.j |c.sub.k)*p(c.sub.k)),
      where;
      space="preserve" listing-type="equation">α
      
      .sup.k.sub.j =N.sub.ci (j)/(N.sub.ci (j)+b*(N.sub.cd (j,k))),
      p(q_j) being the prior probability of the phone class q_j computed by counting in the training set,p(c_k |Y) being the posterior probability of context class c_k given the input speech vector Y,p(q_j |c_k) being the conditional probability of phone class q_j given the context c_k computed by counting in the training set,p(c_k) being the prior probability of the context class c_k computed by counting in the training set,N_ci (j) being the number of examples of phone class q_j, N_cd (j,k) being the number of examples of phone class q_j given the context c_k, andb being a constant optimized on an independent development set.

11. A speech recognition apparatus comprising:
- a hidden Markov model speech recognizer means;
  
  a multilayer perceptron means (MLP), said MLP comprising;
  
  a single input layer for receiving a plurality of input speech vectors from a source of speech vectors, a single hidden layer, a single set of weights between said input layer and said hidden layer, and a plurality of output layers with an associated plurality of sets of weights between said hidden layer and said output layers, each one of said output layers having a plurality of output units for storing a plurality of probability values; and
  
  means for forward propagating each input speech vector through said multilayer perceptron means to produce an activation level representative of a probability value at each output unit within each one of said output layers;
  
  means coupled to said MLP for determining likelihood of observing each speech vector assuming a specific state of a hidden Markov model by factoring, according to Bayes rule, said likelihood of observing being in terms of posterior probabilities of phone classes of the speech vector assuming context and the input speech vector, thereby obtaining values representative of context-dependent estimation; and
  
  wherein said hidden Markov model speech recognizer means employs said values representative of context-dependent estimation as state-dependent observation probabilities to identify a specific estimated word sequence from said input speech vectors, for recognizing speech by context-dependent estimation of a plurality of state-dependent observation probability distributions of phone classes which has weights that have been obtained based on a training set of speech vectors.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The apparatus according to claim 11 wherein said determining means comprises means for factoring according to Bayes rule given by:
    - ##EQU5## where p(Y|c_k) can be factored as
      space="preserve" listing-type="equation">p(Y|c.sub.k)=p(c.sub.k |Y)*p(Y)/p(c.sub.k)
      wherep(Y) is the probability of speech vector Y (which is a term which can be disregarded in the computation, as it will be identical for all recognition paths being compared),p(c_k) is the prior probability of the context class c_k computed by counting in the training set,p(Y|c_k) is the conditional probability of the speech vector Y given context c_k,p(c_k |Y) is the conditional probability of the context c_k given the speech vector Y,p(q_j |Y,c_k) is the context-dependent posterior probability of class j, given speech vector Y and context k,p(Y|q_j,c_k) is the observation likelihood of speech vector Y given class q_j and context c_k,p(q_j |c_k) is the conditional probability of the class q_j given context c_k.
  - 13. The apparatus according to claim 11 further including means for training said multilayer perceptron to optimize said weights to smoothed values, said smoothed values being intermediate of context-independent posterior probability distributions and context-dependent posterior probability distributions using a cross-validation set.
  - 14. The apparatus according to claim 13 wherein said training means comprises:
    - means for initializing input-to-hidden layer weights of said multilayer perceptron with input-to-hidden layer weights from a corresponding context-independent network;
      
      means for initializing each set of context specific hidden-to-output layer weights of said multilayer perceptron with hidden-to-output layer weights from a corresponding context-independent network; and
      
      means for applying the iterative Backpropagation Algorithm with relative entropy criterion to said context-dependent network by presenting training examples from a training set to said input layer, forward propagating values representing activations only to that context-specific output layer corresponding to the context of the input speech vector, and backpropagating and adjusting only those hidden-to-output layer weights corresponding to the context of the input speech vector until said cross-validation set indicates a localized minimum error of classification of speech vectors into phonetic classes.
  - 15. The apparatus according to claim 13 wherein states of the hidden Markov model directly correspond to units of separate output layers associated with the separate states of each phone, wherein said employing means further comprises means for stepping through a sequence for each individual phone while referencing different output layers, said different output layers being organized as a sequence.
  - 16. The apparatus according to claim 15 wherein said separate output layers are trained only with input speech vectors in the training set which are aligned with the corresponding state position within a phone so that a sequence of probabilities can be used to represent a phone and that the training of the MLP can be based on discrimination between phones without being based on discrimination between states within a single phone.
  - 17. The apparatus according to claim 16 wherein each phone has a first state, a middle state and a last state, wherein said first state is constrained only by a predecessor context and said last state is constrained only by a following context.
  - 18. The apparatus according to claim 13, wherein said determining means includes means for scaling said posterior probabilities of phone classes to convert the activation levels of the output units from smoothed context-dependent posterior probabilities to smoothed context-dependent observation likelihoods.
  - 19. The apparatus according to claim 18, wherein the scaling is according to the following relationship:
    - space="preserve" listing-type="equation">p(Y|q.sub.j,c.sub.k)=p(q.sub.j |Y,c.sub.k) * K.sup.k.sub.j,
      whereK^k_j is the scaling factor for class q_j given context c_k,p(q_j |Y,c_k) is the context-dependent posterior probability of class q_j, given speech vector Y and context c_k, andp(Y|q_j,c_k) is the observation likelihood of speech vector Y given class q_j and context c_k.
  - 20. The apparatus according to claim 19, wherein the scaling factor K^k_j from said posterior probabilities to said observation likelihoods is given by:
    - space="preserve" listing-type="equation">K.sup.k.sub.j =α
      
      .sup.k.sub.j /p(q.sub.j)+(1-α
      
      .sup.k.sub.j)* p(c.sub.k |Y)/(p(q.sub.j |c.sub.k)*p(c.sub.k)),
      where;
      space="preserve" listing-type="equation">α
      
      .sup.k.sub.j =N.sub.ci (j)/(N.sub.ci (j)+b*(N.sub.cd (j,k))),
      p(q_j) being the prior probability of the phone class q_j computed by counting in the training set,p(c_k |Y) being the posterior probability of context class the input speech vector Y,p(q_j |c_k) being the conditional probability of phone class q_j given the context c_k computed by counting in the training set,p(c_k) being the prior probability of the context class c_k computed by counting in the training set,N_ci (j) being the number of examples of phone class q_j, N_cd (j,k) being the number of examples of phone class q_j given the context c_k, andb being a constant optimized on an independent development set.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SRI International, Inc.
Original Assignee
SRI International, Inc.
Inventors
Cohen, Michael H., Franco, Horacio E.
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
KIM, RICHARD

Application Number

US07/901,716
Time in Patent Office

708 Days
Field of Search

381/42, 381/41, 395/2.41
US Class Current

704/232
CPC Class Codes

G10L 15/144 Training of HMMs

Method and apparatus for context-dependent estimation of multiple probability distributions of phonetic classes with multilayer perceptrons in a speech recognition system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for context-dependent estimation of multiple probability distributions of phonetic classes with multilayer perceptrons in a speech recognition system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links