Methods and apparatus for discriminative training and adaptation of pronunciation networks
First Claim
1. A speech recognition method, comprising:
- using given speech data and the N-best algorithm to generate alternative pronunciations and then merging the obtained pronunciations into a pronunciation network structure containing pronunciation networks for words in the given speech data;
using additional parameters to characterize the pronunciation network for a particular word;
optimizing the parameters used to characterize the pronunciation network using a minimum classification error criterion that maximizes a discrimination between pronunciation networks for different words;
adapting parameters used to characterize the pronunciation network by, first, adjusting probabilities of the possible pronunciations that may be generated by the pronunciation network for a word claimed to be a true one and, second, to correct weights for all of the pronunciation networks in the pronunciation network structure by using the adjusted probabilities.
3 Assignments
0 Petitions
Accused Products
Abstract
A speech recognition method comprises the steps of using given speech data and the N-best algorithm to generate alternative pronunciations and then merging the obtained pronunciations into a pronunciation networks structure; using additional parameters to characterize a pronunciation network for a particular word; optimizing the parameters of the pronunciation networks using a minimum classification error criterion that maximizes a discrimination between different pronunciation networks; and adapting parameters of the pronunciation networks by, first, adjusting probabilities of the possible pronunciations that may be generated by the pronunciation network for a word claimed to be a true one and, second, to correct weights for all of the pronunciation networks by using the adjusted probabilities.
-
Citations
20 Claims
-
1. A speech recognition method, comprising:
-
using given speech data and the N-best algorithm to generate alternative pronunciations and then merging the obtained pronunciations into a pronunciation network structure containing pronunciation networks for words in the given speech data; using additional parameters to characterize the pronunciation network for a particular word; optimizing the parameters used to characterize the pronunciation network using a minimum classification error criterion that maximizes a discrimination between pronunciation networks for different words; adapting parameters used to characterize the pronunciation network by, first, adjusting probabilities of the possible pronunciations that may be generated by the pronunciation network for a word claimed to be a true one and, second, to correct weights for all of the pronunciation networks in the pronunciation network structure by using the adjusted probabilities.
-
-
2. A method for generating alternative pronunciations from given speech data contained in a training sample O.sup.(k), the method comprising:
-
performing the N-best algorithm on training sample O.sup.(k), transforming it from a feature space X into a discrete string space L;
##EQU11## where ln.sup.(k) (1≦
n≦
N) is a set of N-best pronunciations for the sample O.sup.(k), and where the strings ln.sup.(k) (2≦
n≦
N ) cover the space in the vicinity of best decoded string ll.sup.(k) ;defining the score for best decoded string ln.sup.(k) as ρ
(ll.sup.(k)) and defining a score interval Δ
ε
;detecting strings satisfying the following condition;
space="preserve" listing-type="equation">[ρ
(l.sub.l.sup.(k))-n·
Δ
ε
;
ρ
(l.sub.l.sup.(k))-(n-1)·
Δ
ε
], 1≦
n≦
Nusing a backward search, selecting those string, the scores of which fall into score intervals that have not yet been occupied by already grown strings; merging the obtained N pronunciation strings into a pronunciation network. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A speech recognition system, comprising:
-
a speech data input; a digital speech sampler for digitally sampling the speech data input; an acoustic signal processor for processing the digitally sampled data; a speech recognition stage for recognizing subwords and words in the digitally sampled and processed data by comparing the data with a pronunciation network structure, the pronunciation network structure being generated by the following method; using given speech data and the N-best algorithm to generate alternative pronunciations and then merging the obtained pronunciations into a pronunciation network structure containing pronunciation networks for words in the given speech data; using additional parameters to characterize the pronunciation network for a particular word; optimizing the parameters used to characterize the pronunciation network using a minimum classification error criterion that maximizes a discrimination between pronunciation networks for different words; adapting parameters used to characterize the pronunciation network by, first, adjusting probabilities of the possible pronunciations that may be generated by the pronunciation network for a word claimed to be a true one and, second, to correct weights for all of the pronunciation networks in the pronunciation network structure by using the adjusted probabilities.
-
-
20. A speech recognition system, comprising:
-
a speech data input; a digital speech sampler for digitally sampling the speech data input; an acoustic signal processor for processing the digitally sampled data; a speech recognition stage for recognizing subwords and words in the digitally sampled and processed data by comparing the data with stored pronunciation networks, the stored pronunciation networks generated from given speech data contained in a training sample O.sup.(k) using the following method; (a) performing the N-best algorithm on training sample O.sup.(k), transforming it from a feature space X into a discrete string space L;
##EQU15## where ln.sup.(k) (1≦
n≦
N) is a set of N-best pronunciations for the sample O.sup.(k), and where the strings ln.sup.(k) (2≦
n≦
N) cover the space in the vicinity of best decoded string ll.sup.(k) ;(b) defining the score for best decoded string ll.sup.(k) as ρ
(ll.sup.(k)) and defining a score interval Δ
ε
;(c) detecting strings satisfying the following condition;
space="preserve" listing-type="equation">[ρ
(l.sub.l.sup.(k))-n·
Δ
ε
;
ρ
(l.sub.l.sup.(k))-(n-1)·
Δ
ε
], 1≦
n≦
Nusing a backward search, selecting those string, the scores of which fall into score intervals that have not yet been occupied by already grown strings; and (d) merging the obtained N pronunciation strings into a pronunciation network.
-
Specification