Methods and apparatus for discriminative training and adaptation of pronunciation networks

US 6,076,053 A
Filed: 05/21/1998
Issued: 06/13/2000
Est. Priority Date: 05/21/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A speech recognition method, comprising:

using given speech data and the N-best algorithm to generate alternative pronunciations and then merging the obtained pronunciations into a pronunciation network structure containing pronunciation networks for words in the given speech data;

using additional parameters to characterize the pronunciation network for a particular word;

optimizing the parameters used to characterize the pronunciation network using a minimum classification error criterion that maximizes a discrimination between pronunciation networks for different words;

adapting parameters used to characterize the pronunciation network by, first, adjusting probabilities of the possible pronunciations that may be generated by the pronunciation network for a word claimed to be a true one and, second, to correct weights for all of the pronunciation networks in the pronunciation network structure by using the adjusted probabilities.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition method comprises the steps of using given speech data and the N-best algorithm to generate alternative pronunciations and then merging the obtained pronunciations into a pronunciation networks structure; using additional parameters to characterize a pronunciation network for a particular word; optimizing the parameters of the pronunciation networks using a minimum classification error criterion that maximizes a discrimination between different pronunciation networks; and adapting parameters of the pronunciation networks by, first, adjusting probabilities of the possible pronunciations that may be generated by the pronunciation network for a word claimed to be a true one and, second, to correct weights for all of the pronunciation networks by using the adjusted probabilities.

Citations

20 Claims

1. A speech recognition method, comprising:
- using given speech data and the N-best algorithm to generate alternative pronunciations and then merging the obtained pronunciations into a pronunciation network structure containing pronunciation networks for words in the given speech data;
  
  using additional parameters to characterize the pronunciation network for a particular word;
  
  optimizing the parameters used to characterize the pronunciation network using a minimum classification error criterion that maximizes a discrimination between pronunciation networks for different words;
  
  adapting parameters used to characterize the pronunciation network by, first, adjusting probabilities of the possible pronunciations that may be generated by the pronunciation network for a word claimed to be a true one and, second, to correct weights for all of the pronunciation networks in the pronunciation network structure by using the adjusted probabilities.

2. A method for generating alternative pronunciations from given speech data contained in a training sample O.sup.(k), the method comprising:
- performing the N-best algorithm on training sample O.sup.(k), transforming it from a feature space X into a discrete string space L;
  
  ##EQU11## where l_n.sup.(k) (1≦
  
  n≦
  
  N) is a set of N-best pronunciations for the sample O.sup.(k), and where the strings l_n.sup.(k) (2≦
  
  n≦
  
  N ) cover the space in the vicinity of best decoded string l_l.sup.(k) ;
  
  defining the score for best decoded string l_n.sup.(k) as ρ
  
  (l_l.sup.(k)) and defining a score interval Δ
  
  ε
  
  ;
  detecting strings satisfying the following condition;
  space="preserve" listing-type="equation">[ρ
  
  (l.sub.l.sup.(k))-n·
  
  Δ
  
  ε
  
  ;
  
  ρ
  
  (l.sub.l.sup.(k))-(n-1)·
  
  Δ
  
  ε
  
  ], 1≦
  
  n≦
  
  N
  using a backward search, selecting those string, the scores of which fall into score intervals that have not yet been occupied by already grown strings;
  merging the obtained N pronunciation strings into a pronunciation network.
- View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 3. A method according to claim 2, wherein there are multiple training samples O.sup.(k) (1≦
    - r≦
      
      R), and wherein the step of merging the obtained N pronunciation strings into a pronunciation network including the step of using a clusterization procedure for all the N-best candidates taken from a common list of the N-best strings for all the R training samples.
  - 4. A method according to claim 3, wherein the clusterization procedure comprises:
    - obtaining N pronunciations for each of the R training samples O.sup.(k) and pooling all of them to create a common list of pronunciations, keeping count of any identical strings;
      
      defining a distance between lexical strings;
      
      using a clustering procedure to get M strings that are the cluster centroids; and
      
      merging the M strings-centroids into a pronunciation network, each arc of the pronunciation network corresponding to a subword unit.
  - 5. A method according to claim 4, wherein the step of defining a distance between lexical strings includes using a Levinstein distance.
  - 6. A method according to claim 4, wherein the step of defining a distance between lexical strings includes using a distance obtained as the results of a Viterbi decoding.
  - 7. A method according to claim 4, wherein the step of using a clustering procedure includes using a K-means clustering procedure.
  - 8. A method according to claim 2, further including the following step:
    - using additional parameters to characterize a pronunciation network for a particular word.
  - 9. A method according to claim 8, wherein the step of using additional parameters to characterize a pronunciation network for a particular word includes assigning a score ρ
    - _j.sup.(k) to arc j for the word and obtaining a modified score g_j.sup.(k) using the following formula;
      space="preserve" listing-type="equation">g.sub.j.sup.(k) =u.sub.j.sup.(k) ·
      
      ρ
      
      .sub.j.sup.(k) +c.sub.j.sup.(k).
  - 10. A method according to claim 9, wherein score ρ
    - _j.sup.(k) is a logarithm of the likelihood.
  - 11. A method according to claim 8, wherein the step of using additional parameters to characterize a pronunciation network for a particular word includes weighting the state scores for each subword HMM assigned to a specific arc j of the pronunciation network for the k-th word, by obtaining a modified score g_j.sup.(k) as follows:
    - ##EQU12## where w_js.sup.(k) is a state weight multiplicative term for the s-th state of the j-th arc HMM in the pronunciation network for the k-th word, ρ
      
      _js.sup.(k) is a corresponding score for the s-th state, c_js.sup.(k) is a state weight additive term, S_j.sup.(k) is a total number of HMM states for the subword unit assigned to the j-th arc of the pronunciation network for the k-th word.
  - 12. A method according to claim 8, wherein the step of using additional parameters to characterize a pronunciation network for a particular word includes:
    - using estimates of the probabilities P(l_n.sup.(k) |Λ
      
      ) for all the N.sup.(k) phonemes strings l_n.sup.(k) (1≦
      
      n≦
      
      N) which may be generated by the k-th word pronunciation network.
  - 13. A method according to claim 12, including the following step:
    - upon initialization, evaluating the pronunciation network parameters P(l_n.sup.(k) |Λ
      
      ) by counting the number of strings of subwords assigned to the n-th cluster of the k-th word pronunciation network, l_n.sup.(k) being the centroid for the cluster, P(l_n.sup.(k) |Λ
      
      ) being modified during an adaptation of the k-th word pronunciation network if that word is supposed to be the true one.
  - 14. A method according to claim 9 or 11, wherein the step of using additional parameters to characterize a pronunciation network for a particular word includes:
    - defining a phone duration weighting, in which the phone HMMs are semi-Markov models;
      
      space="preserve" listing-type="equation">G.sub.j.sup.(k) =g.sub.j.sup.(k) +z.sub.j.sup.(k) ·
      
      φ
      
      (T.sub.j.sup.(k))+x.sub.j.sup.(k)where G_j.sup.(k) is a modified score for the j-th arc of the k-th word pronunciation network, z_j.sup.(k) is a multiplicative term for the duration weighting, x_j.sup.(k) is a corresponding additive term for the duration weighting defining a phone insertion penalty, T_j.sup.(k) is a duration for the semi-Markov HMM assigned to the j-th arc of the k-th word pronunciation network, and φ
      
      (T_j.sup.(k)) is a log probability to obtain duration T_j.
  - 15. A method according to claim 8, wherein the step of using additional parameters to characterize a pronunciation network for a particular word includes:
    - using a discriminative minimum classification error to optimize the parameters of the pronunciation network.
  - 16. A method according to claim 8, wherein the step of using additional parameters to characterize a pronunciation network including adapting the parameters that describe the pronunciation networks as follows:
    - determining a class k that represents a current adaptation sample O;
      
      space="preserve" listing-type="equation">OO.sup.(k)
      adjusting the estimates P(l_n.sup.(k) |Λ
      
      ) for all pronunciation string probabilities (1≦
      
      n≦
      
      N.sup.(k)) based on a set of a posteriori probability estimates for the adaptation sample O.sup.(k) consisting of N.sup.(k) probability estimates P(O.sup.(k) |Λ
      
      , l_n.sup.(k));
      
      using the adjusted values for P(l_n.sup.(k) |Λ
      
      ) to reevaluate the distances G(S.sup.(r) ;
      
      S.sup.(k)) from all of the pronunciation networks S.sup.(r) (1≦
      
      r≦
      
      K) to the specified network S.sup.(k) ;
      
      using the reevaluated distances G(S.sup.(r) ;
      
      S.sup.(k)) to adapt the parameters for all pronunciation networks.
  - 17. A method according to claim 16, wherein the step of adjusting the estimates P(l_n.sup.(k) |Λ
    - ) for all pronunciation string probabilities (1≦
      
      n≦
      
      N.sup.(k)) based on a set of a posteriori probability estimates for the adaptation sample O.sup.(k) consisting of N.sup.(k) probability estimates P(O.sup.(k) |Λ
      
      , l_n.sup.(k)), includes adjusting the estimates P(l_n.sup.(k) |Λ
      
      ) for a new adaptation sample O.sup.(k) representing the k-th pronunciation network as follows;
      estimating the a posteriori probabilities;
      
      space="preserve" listing-type="equation">P(O.sup.(k) |Λ
      
      , l.sub.n.sup.(k))=exp[g(O.sup.(k) |Λ
      
      , l.sub.n.sup.(k))];
      
      1≦
      
      n≦
      
      N.sup.(k)where g(O.sup.(k) |Λ
      
      , l_n.sup.(k)) is a likelihood score obtained after Viterbi decoding of the sample O.sup.(k) versus phone string l_n.sup.(k), and where N.sup.(k) is the total number of such phone strings for the network k;
      estimating P(l_n.sup.(k) |Λ
      
      , O.sup.(k)) as follows;
      
      ##EQU13## estimating an adapted value P_adapt (l_n.sup.(k) |Λ
      
      ) by combining the value of P(l_n.sup.(k) |Λ
      
      ) and the adaptation-conditioned value of P(l_n.sup.(k) |Λ
      
      , O.sup.(k)) as follows;
      
      space="preserve" listing-type="equation">P.sub.adapt (l.sub.n.sup.(k) |Λ
      
      )=L·
      
      P(l.sub.n.sup.(k) |Λ
      
      )+(1-L)·
      
      P(l.sub.n.sup.(k) |Λ
      
      , O.sup.(k));
      
      1≦
      
      n≦
      
      N.sup.(k)where 0<
      
      L<
      
      1, and where L is a constant or a function that is dependent on time;
      normalizing the value of P_adapt (l_n.sup.(k) |Λ
      
      ) as follows;
      
      ##EQU14## assigning normalized values P_adapt^norm (l_n.sup.(k) |Λ
      
      ) the corresponding adjusted values of P(l_n.sup.(k) |Λ
      
      ) as follows;
      
      space="preserve" listing-type="equation">P(l.sub.n.sup.(k) |Λ
      
      )=P.sub.adapt.sup.norm (l.sub.n.sup.(k) |Λ
      
      );
      
      1≦
      
      n≦
      
      N.sup.(k).
  - 18. A method according to claim 16, wherein the step of using the reevaluated distances G (S.sup.(r) ;
    - S.sup.(k)) to adapt the parameters for all pronunciation networks includes using a minimum classification error criterion.

19. A speech recognition system, comprising:
- a speech data input;
  
  a digital speech sampler for digitally sampling the speech data input;
  
  an acoustic signal processor for processing the digitally sampled data;
  
  a speech recognition stage for recognizing subwords and words in the digitally sampled and processed data by comparing the data with a pronunciation network structure, the pronunciation network structure being generated by the following method;
  
  using given speech data and the N-best algorithm to generate alternative pronunciations and then merging the obtained pronunciations into a pronunciation network structure containing pronunciation networks for words in the given speech data;
  
  using additional parameters to characterize the pronunciation network for a particular word;
  
  optimizing the parameters used to characterize the pronunciation network using a minimum classification error criterion that maximizes a discrimination between pronunciation networks for different words;
  
  adapting parameters used to characterize the pronunciation network by, first, adjusting probabilities of the possible pronunciations that may be generated by the pronunciation network for a word claimed to be a true one and, second, to correct weights for all of the pronunciation networks in the pronunciation network structure by using the adjusted probabilities.

20. A speech recognition system, comprising:
- a speech data input;
  
  a digital speech sampler for digitally sampling the speech data input;
  
  an acoustic signal processor for processing the digitally sampled data;
  
  a speech recognition stage for recognizing subwords and words in the digitally sampled and processed data by comparing the data with stored pronunciation networks, the stored pronunciation networks generated from given speech data contained in a training sample O.sup.(k) using the following method;
  
  (a) performing the N-best algorithm on training sample O.sup.(k), transforming it from a feature space X into a discrete string space L;
  
  ##EQU15## where l_n.sup.(k) (1≦
  
  n≦
  
  N) is a set of N-best pronunciations for the sample O.sup.(k), and where the strings l_n.sup.(k) (2≦
  
  n≦
  
  N) cover the space in the vicinity of best decoded string l_l.sup.(k) ;
  
  (b) defining the score for best decoded string l_l.sup.(k) as ρ
  
  (l_l.sup.(k)) and defining a score interval Δ
  
  ε
  
  ;
  (c) detecting strings satisfying the following condition;
  space="preserve" listing-type="equation">[ρ
  
  (l.sub.l.sup.(k))-n·
  
  Δ
  
  ε
  
  ;
  
  ρ
  
  (l.sub.l.sup.(k))-(n-1)·
  
  Δ
  
  ε
  
  ], 1≦
  
  n≦
  
  N
  using a backward search, selecting those string, the scores of which fall into score intervals that have not yet been occupied by already grown strings; and
  (d) merging the obtained N pronunciation strings into a pronunciation network.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Korkmazskiy, Filipp E., Juang, Biing-Hwang
Primary Examiner(s)
Hudspeth, David R.
Assistant Examiner(s)
Lerner, Martin

Application Number

US09/082,854
Time in Patent Office

754 Days
Field of Search

704/236, 704/243, 704/244, 704/251, 704/256, 704/231, 704/255
US Class Current

704/236
CPC Class Codes

G10L 15/06   Creation of reference templ...

G10L 15/16   using artificial neural net...

G10L 2015/0635   updating or merging of old ...

Methods and apparatus for discriminative training and adaptation of pronunciation networks

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for discriminative training and adaptation of pronunciation networks

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links