Speaker normalization processor apparatus for generating frequency warping function, and speech recognition apparatus with said speaker normalization processor apparatus
First Claim
1. A speaker normalization processor apparatus comprising:
- a first storage unit for storing speech waveform data of a plurality of normalization-target speakers and text data corresponding to the speech waveform data;
a second storage unit for storing Formant frequencies of a standard speaker determined based on a vocal-tract area function of the standard speaker;
estimation means for estimating feature quantities of a vocal-tract configuration showing an anatomical configuration of a vocal tract of each normalization-target speaker, by looking up to a correspondence between vocal-tract configuration parameters and Formant frequencies previously determined based on a vocal tract model of the standard speaker, based on the speech waveform data of each normalization-target speaker stored in said first storage unit;
function generating means for estimating a vocal-tract area function of each normalization-target speaker by changing feature quantities of a vocal-tract configuration of the standard speaker based on the feature quantities of the vocal-tract configuration of each normalization-target speaker estimated by said estimation means and the feature quantities of the vocal-tract configuration of the standard speaker, estimating Formant frequencies of speech uttered by each normalization-target speaker based on the estimated vocal-tract area function of each normalization-target speaker, and generating a frequency warping function, which shows a correspondence between input speech frequencies and frequencies after frequency warping, and which is used for performing the frequency warping by converting an input speech frequency so that Formant frequencies of speech of each normalization-target speaker after the frequency warping respectively coincide with the corresponding Formant frequencies of the standard speaker stored in said second storage unit.
3 Assignments
0 Petitions
Accused Products
Abstract
In a speaker normalization processor apparatus, a vocal-tract configuration estimator estimates feature quantities of a vocal-tract configuration showing an anatomical configuration of a vocal tract of each normalization-target speaker, by looking up to a correspondence between vocal-tract configuration parameters and Formant frequencies previously determined based on a vocal tract model of the standard speaker, based on speech waveform data of each normalization-target speaker. A frequency warping function generator estimates a vocal-tract area function of each normalization-target speaker by changing feature quantities of a vocal-tract configuration of the standard speaker based on the feature quantities of the vocal-tract configuration of each normalization-target speaker estimated by the estimation means and the feature quantities of the vocal-tract configuration of the standard speaker, estimating Formant frequencies of speech uttered by each normalization-target speaker based on the estimated vocal-tract area function of each normalization-target speaker, and generating a frequency warping function showing a correspondence between input speech frequencies and frequencies after frequency warping.
128 Citations
10 Claims
-
1. A speaker normalization processor apparatus comprising:
-
a first storage unit for storing speech waveform data of a plurality of normalization-target speakers and text data corresponding to the speech waveform data;
a second storage unit for storing Formant frequencies of a standard speaker determined based on a vocal-tract area function of the standard speaker;
estimation means for estimating feature quantities of a vocal-tract configuration showing an anatomical configuration of a vocal tract of each normalization-target speaker, by looking up to a correspondence between vocal-tract configuration parameters and Formant frequencies previously determined based on a vocal tract model of the standard speaker, based on the speech waveform data of each normalization-target speaker stored in said first storage unit;
function generating means for estimating a vocal-tract area function of each normalization-target speaker by changing feature quantities of a vocal-tract configuration of the standard speaker based on the feature quantities of the vocal-tract configuration of each normalization-target speaker estimated by said estimation means and the feature quantities of the vocal-tract configuration of the standard speaker, estimating Formant frequencies of speech uttered by each normalization-target speaker based on the estimated vocal-tract area function of each normalization-target speaker, and generating a frequency warping function, which shows a correspondence between input speech frequencies and frequencies after frequency warping, and which is used for performing the frequency warping by converting an input speech frequency so that Formant frequencies of speech of each normalization-target speaker after the frequency warping respectively coincide with the corresponding Formant frequencies of the standard speaker stored in said second storage unit.
-
-
2. A speaker normalization processor apparatus comprising:
-
a first storage unit for storing speech waveform data of a plurality of training speakers and text data corresponding to the speech waveform data;
a second storage unit for storing Formant frequencies of a standard speaker determined based on a vocal-tract area function of the standard speaker;
first estimation means for estimating feature quantities of a vocal-tract configuration showing an anatomical configuration of a vocal tract of each training speaker, by looking up to a correspondence between vocal-tract configuration parameters and Formant frequencies previously determined based on a vocal tract model of the standard speaker, based on the speech waveform data of each training speaker stored in said first storage unit;
first function generating means for estimating a vocal-tract area function of each training speaker by changing feature quantities of a vocal-tract configuration of the standard speaker based on the feature quantities of the vocal-tract configuration of each training speaker estimated by said first estimation means and the feature quantities of the vocal-tract configuration of the standard speaker, estimating Formant frequencies of speech uttered by each training speaker based on the estimated vocal-tract area function of each training speaker, and generating a frequency warping function, which shows a correspondence between input speech frequencies and frequencies after frequency warping, and which is used for performing the frequency warping by converting an input speech frequency so that Formant frequencies of speech of each training speaker after the frequency warping respectively coincide with the corresponding Formant frequencies of the standard speaker stored in said second storage unit;
first feature extraction means for speaker-normalizing speech waveform data of each training speaker stored in said first storage unit, by executing a frequency warping process on the speech waveform data using the frequency warping function of each training speaker generated by said first function generating means, and then extracting predetermined acoustic feature parameters of each training speaker from the speaker-normalized speech waveform data; and
training means for generating a normalized hidden Markov model by training a predetermined initial hidden Markov model using a predetermined training method based on the acoustic feature parameters of each training speaker extracted by said first feature extraction means and the text data stored in said first storage unit. - View Dependent Claims (3, 4)
wherein the feature quantities of the vocal-tract configuration include a first length on an oral cavity side and a second length on a pharyngeal cavity side of a vocal tract of a speaker. -
4. The speaker normalization processor apparatus as claimed in claim 2,
wherein the acoustic feature parameters include mel-frequency cepstrum coefficients.
-
-
5. A speech recognition apparatus comprising:
-
a first storage unit for storing speech waveform data of a plurality of training speakers and text data corresponding to the speech waveform data;
a second storage unit for storing Formant frequencies of a standard speaker determined based on a vocal-tract area function of the standard speaker;
first estimation means for estimating feature quantities of a vocal-tract configuration showing an anatomical configuration of a vocal tract of each training speaker, by looking up to a correspondence between vocal-tract configuration parameters and Formant frequencies previously determined based on a vocal tract model of the standard speaker, based on the speech waveform data of each training speaker stored in said first storage unit;
first function generating means for estimating a vocal-tract area function of each training speaker by changing feature quantities of a vocal-tract configuration of the standard speaker based on the feature quantities of the vocal-tract configuration of each training speaker estimated by said first estimation means and the feature quantities of the vocal-tract configuration of the standard speaker, estimating Formant frequencies of speech uttered by each training speaker based on the estimated vocal-tract area function of each training speaker, and generating a frequency warping function, which shows a correspondence between input speech frequencies and frequencies after frequency warping, and which is used for performing the frequency warping by converting an input speech frequency so that Formant frequencies of speech of each training speaker after the frequency warping respectively coincide with the corresponding Formant frequencies of the standard speaker stored in said second storage unit;
first feature extraction means for speaker-normalizing speech waveform data of each training speaker stored in said first storage unit, by executing a frequency warping process on the speech waveform data using the frequency warping function of each training speaker generated by said first function generating means, and then extracting predetermined acoustic feature parameters of each training speaker from the speaker-normalized speech waveform data;
training means for generating a normalized hidden Markov model by training a predetermined initial hidden Markov model using a predetermined training method based on the acoustic feature parameters of each training speaker extracted by said first feature extraction means and the text data stored in said first storage unit;
second estimation means for estimating feature quantities of a vocal-tract configuration showing an anatomical configuration of a vocal tract of a speech-recognition speaker, by looking up to a correspondence between vocal-tract configuration parameters and Formant frequencies previously determined based on a vocal tract model of the standard speaker, based on input speech waveform data for adaptation of a speech-recognition speaker;
second function generating means for estimating a vocal-tract area function of each speech-recognition speaker by changing the feature quantities of the vocal-tract configuration of the standard speaker based on the feature quantities of the vocal-tract configuration of the speech-recognition speaker estimated by said second estimation means and the feature quantities of the vocal-tract configuration of the standard speaker, estimating Formant frequencies of speech uttered by each speech-recognition speaker based on the estimated vocal-tract area function of each speech-recognition speaker, and generating a frequency warping function of the speech-recognition speaker, which shows a correspondence between input speech frequencies and frequencies after frequency warping, and which is used for performing the frequency warping by converting an input speech frequency so that Formant frequencies of speech of each speech-recognition speaker after the frequency warping respectively coincide with the corresponding Formant frequencies of the standard speaker stored in said second storage unit;
a third storage unit for storing the frequency warping function of a speech-recognition speaker generated by said second function generating means;
second feature extraction means for speaker-normalizing speech waveform data of speech uttered by a speech-recognition speaker to be recognized by executing a frequency warping process on the speech waveform data using the frequency warping function of the speech-recognition speaker stored in said third storage unit, and then extracting predetermined acoustic feature parameters of the speech-recognition speaker from the speaker-normalized speech waveform data; and
speech recognition means for recognizing the input speech uttered by the speech-recognition speaker by using a hidden Markov model generated by said training means based on the acoustic feature parameters extracted by said second feature extraction means, and then outputting a result of the speech recognition.
-
-
6. A speaker normalization processor apparatus comprising:
-
a first storage unit for storing speech waveform data of a plurality of normalization-target speakers and text data corresponding to the speech waveform data;
a second storage unit for storing Formant frequencies of a standard speaker determined based on a vocal-tract area function of the standard speaker;
estimation means for estimating feature quantities of a vocal-tract configuration showing an anatomical configuration of a vocal tract of each normalization-target speaker, for each of predetermined similar phoneme contexts that are similar in acoustic features to each other, by looking up to a correspondence between vocal-tract configuration parameters and Formant frequencies previously determined based on a vocal tract model of the standard speaker, based on the speech waveform data of each normalization-target speaker stored in said first storage unit;
function generating means for estimating, for each of the similar phoneme contexts, a vocal-tract area function of each normalization-target speaker by changing feature quantities of a vocal-tract configuration of the standard speaker based on the feature quantities of the vocal-tract configuration of each normalization-target speaker estimated for each of the similar phoneme contexts by said estimation means and the feature quantities of the vocal-tract configuration of the standard speaker, estimating Formant frequencies of speech uttered by each normalization-target speaker based on the vocal-tract area function of each normalization-target speaker estimated for each of the similar phoneme contexts, and generating for each of the similar phoneme contexts a frequency warping function, which shows a correspondence between input speech frequencies and frequencies after frequency warping, and which is used for performing the frequency warping by converting an input speech frequency so that Formant frequencies of speech of each normalization-target speaker after the frequency warping respectively coincide with the corresponding Formant frequencies of the standard speaker stored in said second storage unit.
-
-
7. A speech recognition apparatus comprising:
-
a first storage unit for storing speech waveform data of a plurality of training speakers and text data corresponding to the speech waveform data;
a second storage unit for storing, for each of predetermined similar phoneme contexts that are similar in acoustic features to one another, Formant frequencies of a standard speaker determined based on a vocal-tract area function of the standard speaker;
first estimation means for estimating, for each of the similar phoneme contexts, feature quantities of a vocal-tract configuration showing an anatomical configuration of a vocal tract of each training speaker, by looking up to a correspondence between vocal-tract configuration parameters and Formant frequencies previously determined based on a vocal tract model of the standard speaker, based on the speech waveform data of each training speaker stored in said first storage unit;
first function generating means for estimating, for each of the similar phoneme contexts, a vocal-tract area function of each training speaker by changing feature quantities of a vocal-tract configuration of the standard speaker based on the feature quantities of the vocal-tract configuration of each training speaker estimated for each of the similar phoneme contexts by said first estimation means and the feature quantities of the vocal-tract configuration of the standard speaker, estimating for each of the similar phoneme contexts, Formant frequencies of speech uttered by each training speaker based on the vocal-tract area function of each training speaker estimated for each of the similar phoneme contexts, and generating for each of the similar phoneme contexts a frequency warping function, which shows a correspondence between input speech frequencies and frequencies after frequency warping, and which is used for performing the frequency warping by converting an input speech frequency so that Formant frequencies of speech of each training speaker after the frequency warping respectively coincide with the corresponding Formant frequencies of the standard speaker stored in said second storage unit;
first feature extraction means for speaker-normalizing the speech waveform data of each training speaker stored in said first storage unit by executing a frequency warping process on the speech waveform data using the frequency warping function of each training speaker generated for each of the similar phoneme contexts by said first function generating means, and then extracting, for each of the similar phoneme contexts, predetermined acoustic feature parameters of each training speaker from the speaker-normalized speech waveform data; and
training means for generating a normalized hidden Markov model by training a predetermined initial hidden Markov model using a predetermined training method based on the acoustic feature parameters of each training speaker extracted for each of the similar phoneme contexts by said first feature extraction means and the text data stored in said first storage unit. - View Dependent Claims (8, 9)
wherein the feature quantities of the vocal-tract configuration include parameters of vocal-tract cross sections ranging from an oral cavity side to a pharyngeal cavity side of a vocal tract of a speaker. -
9. The speaker normalization processor apparatus as claimed in claim 7,
wherein the similar phoneme context includes at least one of vowel, phoneme, and hidden Markov model state.
-
-
10. A speech recognition apparatus comprising:
-
a first storage unit for storing speech waveform data of a plurality of training speakers and text data corresponding to the speech waveform data;
a second storage unit for storing, for each of predetermined similar phoneme contexts that are similar in acoustic features to one another, Formant frequencies of a standard speaker determined based on a vocal-tract area function of the standard speaker;
first estimation means for estimating, for each of the similar phoneme contexts, feature quantities of a vocal-tract configuration showing an anatomical configuration of a vocal tract of each training speaker, by looking up to a correspondence between vocal-tract configuration parameters and Formant frequencies previously determined based on a vocal tract model of the standard speaker, based on the speech waveform data of each training speaker stored in said first storage unit;
first function generating means for estimating, for each of the similar phoneme contexts, a vocal-tract area function of each training speaker by changing feature quantities of a vocal-tract configuration of the standard speaker based on the feature quantities of the vocal-tract configuration of each training speaker estimated for each of the similar phoneme contexts by said first estimation means and the feature quantities of the vocal-tract configuration of the standard speaker, estimating for each of the similar phoneme contexts, Formant frequencies of speech uttered by each training speaker based on the vocal-tract area function of each training speaker estimated for each of the similar phoneme contexts, and generating for each of the similar phoneme contexts a frequency warping function, which shows a correspondence between input speech frequencies and frequencies after frequency warping, and which is used for performing the frequency warping by converting an input speech frequency so that Formant frequencies of speech of each training speaker after the frequency warping respectively coincide with the corresponding Formant frequencies of the standard speaker stored in said second storage unit;
first feature extraction means for speaker-normalizing the speech waveform data of each training speaker stored in said first storage unit by executing a frequency warping process on the speech waveform data using the frequency warping function of each training speaker generated for each of the similar phoneme contexts by said first function generating means, and then extracting, for each of the similar phoneme contexts, predetermined acoustic feature parameters of each training speaker from the speaker-normalized speech waveform data;
training means for generating a normalized hidden Markov model by training a predetermined initial hidden Markov model using a predetermined training method based on the acoustic feature parameters of each training speaker extracted for each of the similar phoneme contexts by said first feature extraction means and the text data stored in said first storage unit;
second estimation means for estimating, for each of the similar phoneme contexts, feature quantities of a vocal-tract configuration showing an anatomical configuration of a vocal tract of a speech-recognition speaker, by looking up to a correspondence between vocal-tract configuration parameters and Formant frequencies previously determined based on a vocal tract model of the standard speaker, based on input speech waveform data for adaptation of a speech-recognition speaker;
second function generating means for estimating, for each of the similar phoneme contexts, a vocal-tract area function of each speech-recognition speaker by converting the feature quantities of the vocal-tract configuration of the standard speaker based on the feature quantities of the vocal-tract configuration of the speech-recognition speaker estimated for each of the similar phoneme contexts by said second estimation means and the feature quantities of the vocal-tract configuration of the standard speaker, estimating for each of the similar phoneme contexts, Formant frequencies of speech uttered by each speech-recognition speaker based on the vocal-tract area function of each speech-recognition speaker estimated for each of the similar phoneme contexts, generating for each of the similar phoneme contexts a frequency warping function of the speech-recognition speaker, which shows a correspondence between input speech frequencies and frequencies after frequency warping, and which is used for performing the frequency warping by converting an input speech frequency so that Formant frequencies of speech of each speech-recognition speaker after the frequency warping respectively coincide with the corresponding Formant frequencies of the standard speaker stored in said second storage unit, and further generating information as to correspondence between the similar phoneme contexts and the frequency warping functions;
a third storage unit for storing the frequency warping function of a speech-recognition speaker generated for each of the similar phoneme contexts by said second function generating means;
a fourth storage unit for storing the information as to the correspondence between the similar phoneme contexts and the frequency warping functions of the speech-recognition speaker generated by said second function generating means;
second feature extraction means for speaker-normalizing the speech waveform data of speech uttered by a speech-recognition speaker to be recognized by executing a frequency warping process on the speech waveform data using the frequency warping function of the speech-recognition speaker stored for each of the similar phoneme contexts in said third storage unit, and then extracting for each of the similar phoneme contexts predetermined acoustic feature parameters of the speech-recognition speaker from the speaker-normalized speech waveform data; and
speech recognition means for recognizing the input speech uttered by the speech-recognition speaker by looking up to the information as to the correspondence between the similar phoneme contexts and the frequency warping functions of the speech-recognition speaker stored in said fourth storage unit, and by using a hidden Markov model generated by said training means based on the acoustic feature parameters extracted for each of the similar phoneme contexts by said second feature extraction means, and then outputting a result of the speech recognition.
-
Specification