Testing and tuning of automatic speech recognition systems using synthetic inputs generated from its acoustic models
First Claim
1. A speech recognition testing system comprising:
- a speech recognizer that provides an output based upon a sequence of feature vectors;
a pronunciation tool that provides a pronunciation of a provided text having at least one word, the pronunciation including a plurality of phonemes, the pronunciation tool comprising a pronunciation store that stores pronunciations for words and a text-to-speech synthesizer that generates phonemes from text, the pronunciation tool first accessing the pronunciation store to obtain the pronunciation for words identified in the text and if the pronunciation store does not include the pronunciation, then using the text-to-speech synthesizer to obtain the pronunciation for the text;
a model unit generator that generates a model for each of the plurality of phonemes from the provided pronunciation and selects a sequence of Hidden Markov Model states for a Hidden Markov Model (HMM) representative of each of the generated models, the selected sequence of HMM states being a sequence that the speech recognizer is to choose as a best sequence during recognition of speech that is recognized to generate the text, wherein generating a model includes selecting a plurality of candidate HMMs for at least one of the generated models;
a feature vector data store storing feature vectors;
a vector generator that generates the sequence of feature vectors to be provided to the speech recognizer from the provided pronunciation of the provided text, wherein at least one of the feature vectors is generated by selecting, for each state in the sequence of HMM states, from the feature vector data store, a feature vector that has a closest probability distribution match with a given mixture in a Markov state in one of the generated models, such that the selected feature vectors produce a best score for the text when the selected feature vectors are provided to the speech recognizer during recognition of the text;
formatting the selected feature vectors in a format used by the speech recognizer; and
testing the speech recognizer using the formatted, selected feature vectors.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method of testing and tuning a speech recognition system by providing pronunciations to the speech recognizer. First a text document is provided to the system and converted into a sequence of phonemes representative of the words in the text. The phonemes are then converted to model units, such as Hidden Markov Models. From the models a probability is obtained for each model or state, and feature vectors are determined. The feature vector matching the most probable vector for each state is selected for each model. These ideal feature vectors are provided to the speech recognizer, and processed. The end result is compared with the original text, and modifications to the system can be made based on the output text.
-
Citations
11 Claims
-
1. A speech recognition testing system comprising:
-
a speech recognizer that provides an output based upon a sequence of feature vectors; a pronunciation tool that provides a pronunciation of a provided text having at least one word, the pronunciation including a plurality of phonemes, the pronunciation tool comprising a pronunciation store that stores pronunciations for words and a text-to-speech synthesizer that generates phonemes from text, the pronunciation tool first accessing the pronunciation store to obtain the pronunciation for words identified in the text and if the pronunciation store does not include the pronunciation, then using the text-to-speech synthesizer to obtain the pronunciation for the text; a model unit generator that generates a model for each of the plurality of phonemes from the provided pronunciation and selects a sequence of Hidden Markov Model states for a Hidden Markov Model (HMM) representative of each of the generated models, the selected sequence of HMM states being a sequence that the speech recognizer is to choose as a best sequence during recognition of speech that is recognized to generate the text, wherein generating a model includes selecting a plurality of candidate HMMs for at least one of the generated models; a feature vector data store storing feature vectors; a vector generator that generates the sequence of feature vectors to be provided to the speech recognizer from the provided pronunciation of the provided text, wherein at least one of the feature vectors is generated by selecting, for each state in the sequence of HMM states, from the feature vector data store, a feature vector that has a closest probability distribution match with a given mixture in a Markov state in one of the generated models, such that the selected feature vectors produce a best score for the text when the selected feature vectors are provided to the speech recognizer during recognition of the text; formatting the selected feature vectors in a format used by the speech recognizer; and testing the speech recognizer using the formatted, selected feature vectors. - View Dependent Claims (2, 3, 4)
-
-
5. A method of testing a speech recognition system, comprising:
-
receiving a text containing at least one word; generating a pronunciation for the text with a pronunciation tool, including a plurality of phonemes, by first accessing a pronunciation data store to obtain phonemes indicating pronunciation of the at least one word, and if the pronunciation data store does not contain phonemes for the at least one word, then providing the received text to a text-to-speech synthesizer to obtain the phonemes indicating the pronunciation of the at least one word; generating a model for each of the phonemes of the pronunciation and selecting a Hidden Markov Model sequence of states for a Hidden Markov Model (HMM) representative of each of the generated models, the selected sequence of HMM states being a sequence that the speech recognizer is to choose as a best sequence during recognition of speech that includes the at least one word, wherein generating a model includes selecting a plurality of candidate HMMs for at least one of the generated models; generating a sequence of feature vectors for the pronunciation from the model, wherein at least one of the feature vectors is generated by selecting, for each state in the sequence of HMM states, from a feature vector data store, a feature vector that has a closest probability distribution match with a given mixture in a Markov state in one of the generated models, such that the selected feature vectors produce a best score for the text when the selected feature vectors are provided to the speech recognizer during recognition of the at least one word; providing the sequence of vectors to the speech recognition system; and outputting text from the speech recognition system, in response to the provided sequence of vectors, for testing evaluation. - View Dependent Claims (6, 7, 8, 9, 10, 11)
-
Specification