Phoneme based speech recognition
First Claim
1. A method of preparing phoneme models for recognition of speech received via telephone lines comprising the steps of:
- a) analyzing a training word to generate a frame sequence of acoustic parameter vectors representative thereof and obtaining a phoneme model sequence of the training word;
b) providing a first set of model parameters representative of a sequence of state-transition models corresponding to each phoneme of the phoneme model sequence in the training word, the parameters including a mean vector and weighting factor for each transition and a covariance matrix for each model;
c) computing a set of observation probabilities for each phoneme of the phoneme model sequence the training word and the first set of model parameters;
d) aligning the frame sequence of acoustic parameter vectors with the sequence of state-transition models to provide a mapping therebetween representative of a path through the sequence of state-transition models having a highest likelihood;
e) accumulating statistics for a plurality of utterances of said training word using the mapping of step d);
f) generating a second set of model parameters representative of the sequence of state-transition models corresponding to each phoneme of the phoneme model sequence the training word;
g) repeating step d) for the second set of model parameters;
h) comparing the likelihood of the first and second sets of model parameters; and
i) repeating step b) through h), replacing the first set of model parameters by the second set of model parameters when the second set of parameters provides at least a predetermined improvement in likelihood; and
j) substituting the covariance matrix of a first model with the covariance matrix of a second model to provide a smooth covariance matrix thereby improving recognition accuracy for the first model.
2 Assignments
0 Petitions
Accused Products
Abstract
A flexible vocabulary speech recognition system is provided for recognizing speech transmitted via the public switched telephone network. The flexible vocabulary recognition (FVR) system is a phoneme based system. The phonemes are modelled as hidden Markov models. The vocabulary is represented as concatenated phoneme models. The phoneme models are trained using Viterbi training enhanced by: substituting the covariance matrix of given phonemes by others, applying energy level thresholds and voiced, unvoiced, silence labelling constraints during Viterbi training. Specific vocabulary members, such as digits, are represented by allophone models. A* searching of the lexical network is facilitated by providing a reduced network which provides estimate scores used to evaluate the recognition path through the lexical network. Joint recognition and rejection of out-of-vocabulary words are provided by using both cepstrum and LSP parameter vectors.
198 Citations
28 Claims
-
1. A method of preparing phoneme models for recognition of speech received via telephone lines comprising the steps of:
-
a) analyzing a training word to generate a frame sequence of acoustic parameter vectors representative thereof and obtaining a phoneme model sequence of the training word; b) providing a first set of model parameters representative of a sequence of state-transition models corresponding to each phoneme of the phoneme model sequence in the training word, the parameters including a mean vector and weighting factor for each transition and a covariance matrix for each model; c) computing a set of observation probabilities for each phoneme of the phoneme model sequence the training word and the first set of model parameters; d) aligning the frame sequence of acoustic parameter vectors with the sequence of state-transition models to provide a mapping therebetween representative of a path through the sequence of state-transition models having a highest likelihood; e) accumulating statistics for a plurality of utterances of said training word using the mapping of step d); f) generating a second set of model parameters representative of the sequence of state-transition models corresponding to each phoneme of the phoneme model sequence the training word; g) repeating step d) for the second set of model parameters; h) comparing the likelihood of the first and second sets of model parameters; and i) repeating step b) through h), replacing the first set of model parameters by the second set of model parameters when the second set of parameters provides at least a predetermined improvement in likelihood; and j) substituting the covariance matrix of a first model with the covariance matrix of a second model to provide a smooth covariance matrix thereby improving recognition accuracy for the first model. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method of preparing phoneme models for recognition of speech received via telephone lines comprising the steps of:
-
a) analyzing a training word to generate a frame sequence of acoustic parameter vectors representative thereof, having a parameter indicative of energy level for each frame of the frame sequence and obtaining a phoneme model sequence of the training word; b) providing a first set of model parameters representative of a sequence of state-transition models corresponding to each phoneme of the phoneme model sequence in the training word; c) computing a set of observation probabilities by, for each frame of the frame sequence and each model of the sequence of state-transition models, comparing the energy level of the frame with a predetermined, relative to noise on the telephone lines, energy threshold for the model, and if the energy level is below the energy threshold, setting the observation probability for the frame to zero, otherwise computing the observation probability for the frame; d) aligning the frame sequence of acoustic parameter vectors with the sequence of state-transition models to provide a mapping therebetween representative of a path through the sequence of state-transition models having a highest likelihood; e) accumulating statistics for a plurality of utterances of said training word using the mapping of step d); f) generating a second set of model parameters representative of the sequence of state-transition models corresponding to the training word; g) repeating step d) for the second set of model parameters; h) comparing the likelihood of the first and second sets of model parameters; and i) repeating step b) through h), replacing the first set of model parameters by the second set of model parameters when the second set of parameters provides at least a predetermined improvement in likelihood. - View Dependent Claims (7, 8, 9, 10, 11)
-
-
12. A method of preparing phoneme models for recognition of speech received via telephone lines comprising the steps of:
-
a) analyzing a training word to generate a frame sequence of acoustic parameter vectors representative thereof and to label each frame of the frame sequence as voiced, unvoiced or silence; b) providing a first set of model parameters representative of a sequence of state-transition models corresponding to the training word, including the step of labelling each model as voiced, unvoiced or silence in dependence upon a phoneme represented by the model and a relative position of the model in the sequence; c) computing a set of observation probabilities by, for each frame of the frame sequence and each model of the model sequence of state-transition models, comparing a voiced-unvoiced-silence (VUS) label of the frame with a VUS label of the model and if the labels do not match, setting the observation probability for the frame to zero, otherwise computing the observation probability for the frame for the training word and the first set of model parameters; d) aligning the frame sequence of acoustic parameter vectors with the sequence of state-transition models to provide a mapping therebetween representative of a path through the sequence of state-transition models having a highest likelihood; e) accumulating statistics for a plurality of utterances of said training word using the mapping of step d); f) generating a second set of model parameters representative of the sequence of state-transition models corresponding to the training word; g) repeating step d) for the second set of model parameters; h) comparing the likelihood of the first and second sets of model parameters; and i) repeating step b) through h), replacing the first set of model parameters by the second set of model parameters when the second set of parameters provides at least a predetermined improvement in likelihood. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
-
-
24. A method of speech recognition for speech received via telephone lines comprising the steps of:
-
a) analyzing an unknown utterance to generate a frame sequence of acoustic parameter vectors representative thereof; b) providing a first network representing a recognition vocabulary, wherein each branch of the first network is a model representing a phoneme and each complete path through the first network is a sequence of models representing a word in the recognition vocabulary; c) providing a second network derived from the first network, in which all sequences of three consecutive phonemes present in the first network are present; d) computing transitional probabilities for each node of the second network given the frame sequence of acoustic parameter vectors; e) searching the second network to determine optimal cumulative probabilities for each node of the second network for all frames of the frame sequence; f) storing the cumulative probabilities as estimate scores for estimating partial paths in the first network; g) computing point scores for all phonemes in dependence upon the frame sequence of acoustic parameter vectors; h) determining a complete path through the first network by evaluating successive one phoneme extensions of partial paths using the estimate scores for the nodes of the second network to find the partial path to extend; wherein the step of determining includes the step of initiating a search through the first network by; establishing a stack for recording all paths from the end of the first network; looking along all branches of the network two phonemes; obtaining estimate scores for each node of the second network corresponding to each two phoneme branch of the first network; entering each estimate score into the stack that arranges the scores in descending order; wherein the step of determining includes the steps of expanding a top entry in the stack by; obtaining point scores for the first phoneme of the two phoneme branch closest to the end of the first network; for every exit time in the stack entry and for all possible entrance times for the phoneme determine total actual probability by adding exit scores from the stack entry to point scores for the first phoneme; computing a new stack entry by adding estimate scores (Pest) for a next two phoneme node of the second network to the total actual probability (Pact) for all possible entrance times, selecting n best total probabilities (Pact +Pest), where n is an integer, and storing the total actual probabilities Pact and frame times for each, together with the best total probability (Pact +Pest) and a phoneme sequence as the new stack entry; wherein the step of expanding a top entry includes the steps of; a) storing a least number, q, of frames used in the estimate score for the top entry of the stack; b) prior to expanding any top entry of the stack, determining the number of frames, r, used in its estimate score; and c) discarding, from the stack, the top entry when r is greater than the greater of (q+75 and q+s/2) where s is the length of the unknown utterance in frames. - View Dependent Claims (25, 26)
-
-
27. Apparatus for speech recognition, comprising:
-
means for analyzing an unknown utterance to generate a frame sequence of acoustic parameter vectors representative thereof; b) means for providing a first network representing a recognition vocabulary, wherein each branch of the first network is a model representing a phoneme and each complete path through the first network is a sequence of models representing a word in the recognition vocabulary; c) means for providing a second network derived from the first network, in which all sequences of three consecutive phonemes present in the first network are present; d) means for computing transitional probabilities for each node of the second network given the frame sequence of acoustic parameter vectors; e) means for searching the second network to determine optimal cumulative probabilities for each node of the second network for all frames of the frame sequence; f) means for storing the cumulative probabilities as estimate scores for estimating partial paths in the first network; g) means for computing point scores for all phonemes in the second network in dependence upon the frame sequence of acoustic parameter vectors; and h) means for determining a complete path through the first network by evaluating successive one phoneme extensions of partial paths using the estimate scores for the nodes of the second network to find the partial path to extend.
-
-
28. Apparatus for providing information via a telephone network, comprising:
-
means for accepting a call from an operator via the telephone network; means for prompting the operator to request information; means for recognizing speech from the operator to identify a member of a recognition vocabulary, the recognition vocabulary being represented by a first network, wherein each branch represents a phoneme and each complete path through the first network is a sequence of models representing a member of the recognition vocabulary and wherein a second network is derived from the first network, in which all sequences of three consecutive phonemes present in the first network are present; means for accessing a computerized information source to request information from the computerized information source in dependence upon the member of the recognition vocabulary identified; and means for synthesizing speech to provide the accessed information to the operator in verbal form.
-
Specification