Self-learning speaker adaptation based on spectral bias source decomposition, using very short calibration speech

US 5,794,192 A
Filed: 09/12/1996
Issued: 08/11/1998
Est. Priority Date: 04/29/1993
Status: Expired due to Fees

First Claim

Patent Images

1. A speech recognition method comprising the steps of:

a. providing training speech that includes a passage of calibration speech for each training speaker;

b. representing the training speech in a spectral domain such that each training speech utterance is represented by a sequence of training speech spectra;

c. building a first set of Gaussian density phone models from the spectra of all calibration speech;

d. estimating a spectral bias indicative of speaker acoustic characteristics for each calibration speech using said first set of Gaussian density phone models;

e. normalizing the training speech spectra based on speaker acoustic characteristics using said spectral bias;

f. building a second set of Gaussian mixture density phone models having parameters of mean vectors, covariance matrices and mixture weights from said normalized training speech spectra;

g. taking a passage of calibration speech from each speaker;

h. representing the calibration speech in a spectral domain such that each calibration speech utterance is represented by a sequence of speech spectra;

i. estimating a spectral bias indicative of speaker acoustic characteristics for each calibration speech using said second set of Gaussian mixture density phone models built in step f;

j. normalizing the calibration speech spectra based on speaker acoustic characteristics using said spectral bias;

k. adapting the phone model parameters based on speaker phonologic characteristics using the normalized calibration speech, where context modulation vectors are estimated between Gaussian densities in each mixture, and the context modulation vectors are used to shift the spectra of the calibration speech;

l. providing test speech for speech recognition;

m. representing the test speech in a spectral domain such that the test speech is represented by a sequence of test speech spectra;

n. normalizing the test speech spectra based on speaker acoustic characteristics using said spectral bias;

o. using the normalized test speech spectra in conjunction with the adapted Gaussian mixture density phone models to recognize the test speech.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speaker adaptation technique based on the separation of speech spectra variation sources is developed for improving speaker-independent continuous speech recognition. The variation sources include speaker acoustic characteristics, and contextual dependency of allophones. Statistical methods are formulated to normalize speech spectra based on speaker acoustic characteristics and then adapt mixture Gaussian density phone models based on speaker phonologic characteristics. Adaptation experiments using short calibration speech (5 sec./speaker) have shown substantial performance improvement over the baseline recognition system.

38 Citations

View as Search Results

16 Claims

1. A speech recognition method comprising the steps of:
- a. providing training speech that includes a passage of calibration speech for each training speaker;
  
  b. representing the training speech in a spectral domain such that each training speech utterance is represented by a sequence of training speech spectra;
  
  c. building a first set of Gaussian density phone models from the spectra of all calibration speech;
  
  d. estimating a spectral bias indicative of speaker acoustic characteristics for each calibration speech using said first set of Gaussian density phone models;
  
  e. normalizing the training speech spectra based on speaker acoustic characteristics using said spectral bias;
  
  f. building a second set of Gaussian mixture density phone models having parameters of mean vectors, covariance matrices and mixture weights from said normalized training speech spectra;
  
  g. taking a passage of calibration speech from each speaker;
  
  h. representing the calibration speech in a spectral domain such that each calibration speech utterance is represented by a sequence of speech spectra;
  
  i. estimating a spectral bias indicative of speaker acoustic characteristics for each calibration speech using said second set of Gaussian mixture density phone models built in step f;
  
  j. normalizing the calibration speech spectra based on speaker acoustic characteristics using said spectral bias;
  
  k. adapting the phone model parameters based on speaker phonologic characteristics using the normalized calibration speech, where context modulation vectors are estimated between Gaussian densities in each mixture, and the context modulation vectors are used to shift the spectra of the calibration speech;
  
  l. providing test speech for speech recognition;
  
  m. representing the test speech in a spectral domain such that the test speech is represented by a sequence of test speech spectra;
  
  n. normalizing the test speech spectra based on speaker acoustic characteristics using said spectral bias;
  
  o. using the normalized test speech spectra in conjunction with the adapted Gaussian mixture density phone models to recognize the test speech.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1 wherein the step of providing training speech is performed by providing sample speech from a plurality of persons that includes calibration speech consisting of the same predefined set of words.
  - 3. The method of claim 2 wherein the predefined set of words is uttered in continuous speech fashion.
  - 4. The method of claim 1 wherein said step of representing the training speech in a spectral domain comprises extracting PLP cepstrum coefficients indicative of phonetic features of the speech.
  - 5. The method of claim 1 wherein said step of representing the training speech in a spectral domain comprises extracting first-order temporal regression coefficients to represent dynamic features of the speech.
  - 6. The method of claim 1 wherein said normalizing steps are performed by estimating the spectral deviation vector and subsequently removing said vector from the speech spectra.
  - 7. The method of claim 1 wherein the step of normalizing the training speech spectra is performed by estimating the parameters of unimodel Gaussian density phone models.
  - 8. The method of claim 7 further comprising using said phone models to estimate the spectral deviation vector for each of the speakers and subsequently removing said vector from the speech spectra for each of the speakers.
  - 9. The method of claim 1 wherein the step of normalizing the training speech spectra is performed by:
    - (1) generating a set of unimodal Gaussian density phone models from the calibration speech; and
      
      (2) using said set of unimodal Gaussian density phone models to estimate the spectral deviation vector for each of the speakers and subsequently removing said spectral deviation vector from the speech spectra for each of the speakers.
  - 10. The method of claim 1 wherein the step of adapting the phone model parameters is performed by modifying the Gaussian mixture density parameters based on context-modulated acoustically normalized calibration speech from a specific speaker.
  - 11. The method of claim 10 wherein the context-modulated calibration speech is generated by subtracting context modulation vectors from the calibration speech of said specific speaker.
  - 12. The method of claim 11 wherein said context modulation vectors are estimated based on training data from a plurality of training speakers and said Gaussian mixture density phone model.
  - 13. The method of claim 11 wherein said subtracting comprises subtracting a context modulation vector from a segment of a phone unit in the calibration speech of said specific speaker for each Gaussian density in the Gaussian mixture density for the state of the phone unit.
  - 14. The method of claim 13 wherein the phone unit segment is obtained by an automatic segmentation of the calibration speech of the specific speaker.
  - 15. The method of claim 13 wherein the phone unit segment is obtained by Viterbi segmentation of the calibration speech of the specific speaker.
  - 16. The method of claim 10 wherein said step of modifying the Gaussian mixture density parameters is performed by Bayesian estimation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Matsushita Electric Corporation Of America (Panasonic Holdings Corporation)
Original Assignee
Panasonic Technologies, Inc. (Panasonic Holdings Corporation)
Inventors
Zhao, Yunxin
Primary Examiner(s)
Hudspeth, David R.
Assistant Examiner(s)
Smits, Talivaldis Ivars

Application Number

US08/712,802
Time in Patent Office

698 Days
Field of Search

704/240, 704/242, 704/244, 704/254
US Class Current

704/244
CPC Class Codes

G10L 15/063   Training

G10L 15/065   Adaptation

G10L 15/07   to the speaker

G10L 15/10   using distance or distortio...

G10L 15/20   Speech recognition techniqu...

G10L 21/02   Speech enhancement, e.g. no...

Self-learning speaker adaptation based on spectral bias source decomposition, using very short calibration speech

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

38 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Self-learning speaker adaptation based on spectral bias source decomposition, using very short calibration speech

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

38 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links