Self-learning speaker adaptation based on spectral variation source decomposition

US 5,664,059 A
Filed: 09/16/1996
Issued: 09/02/1997
Est. Priority Date: 04/29/1993
Status: Expired due to Fees

First Claim

Patent Images

1. A self-learning speaker adaptation method for automatic speech recognition comprising:

providing training speech from a plurality of training speakers;

transforming the training speech into a spectral domain such that each training speech utterance is represented by a sequence of training speech spectra;

building a set of Gaussian density phone models from the spectra of all training speech;

estimating a spectral bias indicative of speaker acoustic characteristics for each speech utterance using the said set of Gaussian density phone models;

normalizing the training speech spectra based on speaker acoustic characteristics using the said spectral bias;

building a plurality of Gaussian mixture density phone models having model parameters including covariance matrices and means vectors and mixture weights using the normalized training speech spectra for use in recognizing speech;

transforming a first utterance of speech into a spectral domain;

estimating a spectral bias indicative of speaker acoustic characteristics for the first utterance of speech using the said set of Gaussian density phone models;

normalizing the first utterance of speech spectra using the said spectral bias;

recognizing the normalized first utterance of speech spectra to produce a recognized word string;

segmenting the first utterance of speech spectra using said recognized word string to produce segmented adaptation data;

modifying the model parameters based on said segmented adaptation data to produce a set of adapted Gaussian mixture density phone models; and

repeating and transforming, estimating, normalizing, recognizing, segmenting and modifying steps for subsequent utterances, using for each subsequent utterance the adapted Gaussian mixture density phone models produced from the previous utterance, whereby the Gaussian mixture density phone models are automatically adapted to that speaker in self-learning fashion.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A self-learning speaker adaptation method for automatic speech recognition is provided. The method includes building a plurality of Gaussian mixture density phone models for use in recognizing speech. The Gaussian mixture density phone models are used to recognize a first utterance of speech from a given speaker. After the first utterance of speech has been recognized, the recognized first utterance of speech is used to adapt the Gaussian mixture density hone models for use in recognizing a subsequent utterance of speech from that same speaker, whereby the Gaussian mixture density phone models are automatically adapted to that speaker in self-learning fashion to thereby produce a plurality of adapted Gaussian mixture density phone models.

Citations

11 Claims

1. A self-learning speaker adaptation method for automatic speech recognition comprising:
- providing training speech from a plurality of training speakers;
  
  transforming the training speech into a spectral domain such that each training speech utterance is represented by a sequence of training speech spectra;
  
  building a set of Gaussian density phone models from the spectra of all training speech;
  
  estimating a spectral bias indicative of speaker acoustic characteristics for each speech utterance using the said set of Gaussian density phone models;
  
  normalizing the training speech spectra based on speaker acoustic characteristics using the said spectral bias;
  
  building a plurality of Gaussian mixture density phone models having model parameters including covariance matrices and means vectors and mixture weights using the normalized training speech spectra for use in recognizing speech;
  
  transforming a first utterance of speech into a spectral domain;
  
  estimating a spectral bias indicative of speaker acoustic characteristics for the first utterance of speech using the said set of Gaussian density phone models;
  
  normalizing the first utterance of speech spectra using the said spectral bias;
  
  recognizing the normalized first utterance of speech spectra to produce a recognized word string;
  
  segmenting the first utterance of speech spectra using said recognized word string to produce segmented adaptation data;
  
  modifying the model parameters based on said segmented adaptation data to produce a set of adapted Gaussian mixture density phone models; and
  
  repeating and transforming, estimating, normalizing, recognizing, segmenting and modifying steps for subsequent utterances, using for each subsequent utterance the adapted Gaussian mixture density phone models produced from the previous utterance, whereby the Gaussian mixture density phone models are automatically adapted to that speaker in self-learning fashion.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1 wherein the steps of transforming the training speech and first utterance of speech into a spectral domain comprise extracting PLP cepstrum coefficients and energy indicative of phonetic features of the speech.
  - 3. The method of claim 1 wherein the steps of transforming the training speech and first utterance of speech into a spectral domain comprise extracting first-order temporal regression coefficients for each PLP cepstrum coefficient and energy to represent dynamic features of the speech.
  - 4. The method of claim 1 wherein the step of segmenting the first utterance of speech spectra is performed by Viterbi segmentation.
  - 5. The method of claim 1 wherein each of said Gaussian mixture density phone models built from the spectra of all training speech is a Hidden Markov Model.
  - 6. The method of claim 1 wherein said adapted Gaussian mixture density phone models are Hidden Markov Models.
  - 7. The method of claim 1 wherein the step of modifying the model parameters is performed by performing context modulated upon said segmented adaptation data to produce context modulated adaptation data and by modifying the model parameters based on said context modulated adaptation data.
  - 8. The method of claim 7 wherein said context modulation is performed by estimating relations between mixture component Gaussian densities in a mixture density and using the estimated relations to perform data mapping to augment adaptation data.
  - 9. The method of claim 8 wherein said context modulation for mapping spectra belonging to one mixture component density to another mixture component density is further performed by:
    - computing a context modulation vector as the mean vector of said another mixture component density minus the mean vector of said one mixture component density; and
      
      adding the estimated context modulation vector to said spectra to obtain context-modulated spectra for said another mixture component density.
  - 10. The method of claim 1 wherein the step of modifying the model parameters is performed by:
    - estimating an interpolation parameter from said segmented adaptation data for each mixture component Gaussian density, the interpolation parameter being a measure of the amount of adaptation data present in the given mixture component Gaussian density;
      
      if said interpolation parameter is equal to or above a predetermined threshold, adapting the mixture component Gaussian density using said adaptation data;
      
      if said interpolation parameter is below said predetermined threshold, performing context modulation on said adaptation data and adapting the mixture component Gaussian density using the context modulated adaptation data.
  - 11. The method of claim 10 wherein said decoded first utterance is segmented using Viterbi segmentation supervised by decoded words in the first utterance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Matsushita Electric Corporation Of America (Panasonic Holdings Corporation)
Original Assignee
Panasonic Technologies, Inc. (Panasonic Holdings Corporation)
Inventors
Zhao, Yunxin
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
Smits, Talivaldis Ivars

Application Number

US08/710,361
Time in Patent Office

351 Days
Field of Search

395/2.49, 395/2.51, 395/2.54, 395/2.55, 395/2.58, 395/2.59, 395/2.6, 395/2.61, 395/2.63, 395/2.65
US Class Current

704/254
CPC Class Codes

G10L 15/07   to the speaker

G10L 15/144   Training of HMMs

G10L 2015/025   Phonemes, fenemes or fenone...

Self-learning speaker adaptation based on spectral variation source decomposition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

Self-learning speaker adaptation based on spectral variation source decomposition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links