Self-learning speaker adaptation based on spectral variation source decomposition
First Claim
1. A self-learning speaker adaptation method for automatic speech recognition comprising:
- providing training speech from a plurality of training speakers;
transforming the training speech into a spectral domain such that each training speech utterance is represented by a sequence of training speech spectra;
building a set of Gaussian density phone models from the spectra of all training speech;
estimating a spectral bias indicative of speaker acoustic characteristics for each speech utterance using the said set of Gaussian density phone models;
normalizing the training speech spectra based on speaker acoustic characteristics using the said spectral bias;
building a plurality of Gaussian mixture density phone models having model parameters including covariance matrices and means vectors and mixture weights using the normalized training speech spectra for use in recognizing speech;
transforming a first utterance of speech into a spectral domain;
estimating a spectral bias indicative of speaker acoustic characteristics for the first utterance of speech using the said set of Gaussian density phone models;
normalizing the first utterance of speech spectra using the said spectral bias;
recognizing the normalized first utterance of speech spectra to produce a recognized word string;
segmenting the first utterance of speech spectra using said recognized word string to produce segmented adaptation data;
modifying the model parameters based on said segmented adaptation data to produce a set of adapted Gaussian mixture density phone models; and
repeating and transforming, estimating, normalizing, recognizing, segmenting and modifying steps for subsequent utterances, using for each subsequent utterance the adapted Gaussian mixture density phone models produced from the previous utterance, whereby the Gaussian mixture density phone models are automatically adapted to that speaker in self-learning fashion.
1 Assignment
0 Petitions
Accused Products
Abstract
A self-learning speaker adaptation method for automatic speech recognition is provided. The method includes building a plurality of Gaussian mixture density phone models for use in recognizing speech. The Gaussian mixture density phone models are used to recognize a first utterance of speech from a given speaker. After the first utterance of speech has been recognized, the recognized first utterance of speech is used to adapt the Gaussian mixture density hone models for use in recognizing a subsequent utterance of speech from that same speaker, whereby the Gaussian mixture density phone models are automatically adapted to that speaker in self-learning fashion to thereby produce a plurality of adapted Gaussian mixture density phone models.
-
Citations
11 Claims
-
1. A self-learning speaker adaptation method for automatic speech recognition comprising:
-
providing training speech from a plurality of training speakers; transforming the training speech into a spectral domain such that each training speech utterance is represented by a sequence of training speech spectra; building a set of Gaussian density phone models from the spectra of all training speech; estimating a spectral bias indicative of speaker acoustic characteristics for each speech utterance using the said set of Gaussian density phone models; normalizing the training speech spectra based on speaker acoustic characteristics using the said spectral bias; building a plurality of Gaussian mixture density phone models having model parameters including covariance matrices and means vectors and mixture weights using the normalized training speech spectra for use in recognizing speech; transforming a first utterance of speech into a spectral domain; estimating a spectral bias indicative of speaker acoustic characteristics for the first utterance of speech using the said set of Gaussian density phone models; normalizing the first utterance of speech spectra using the said spectral bias; recognizing the normalized first utterance of speech spectra to produce a recognized word string; segmenting the first utterance of speech spectra using said recognized word string to produce segmented adaptation data; modifying the model parameters based on said segmented adaptation data to produce a set of adapted Gaussian mixture density phone models; and repeating and transforming, estimating, normalizing, recognizing, segmenting and modifying steps for subsequent utterances, using for each subsequent utterance the adapted Gaussian mixture density phone models produced from the previous utterance, whereby the Gaussian mixture density phone models are automatically adapted to that speaker in self-learning fashion. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
Specification