Hierarchial subband linear predictive cepstral features for HMM-based speech recognition

US 6,292,776 B1
Filed: 03/12/1999
Issued: 09/18/2001
Est. Priority Date: 03/12/1999
Status: Expired due to Term

First Claim

Patent Images

1. A training method for a speech recognizer comprising the steps of:

receiving a band limited voice input utterance that is time varying;

transforming said utterance using a fast fourier transform process to a frequency domain spectrum;

forwarding said frequency domain spectrum to a plurality of mel filter banks, at least one of said plurality of mel filter banks having a plurality of sub-bands filtering said frequency spectrum;

transforming an output of each of said plurality of mel-filter banks using an inverse discrete fourier transform process to obtain a processed speech output that is time varying from each of said mel-filter banks and an additional time varying output for each sub-band above one for each mel-filter bank;

analyzing each output of each of time varying outputs of each inverse discrete fourier transform process using a respective linear prediction cepstral analysis to produce an individual feature vector output corresponding to each inverse discrete fourier transform output;

appending said individual feature vectors forming a grand feature vector;

conditioning said grand feature vector and removing any bias from said grand feature vector using a bias remover;

performing MSE/GPD training on said grand feature vector after the bias is removed;

building HMMs from said MSE/GPD training; and

extracting a bias removal codebook of size four from the mean vectors of said HMMs for use with said bias removal in said signal conditioning of the grand feature vector.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for first training and then recognizing speech. The method and apparatus use subband cepstral features to improve the recognition string accuracy rates for speech inputs.

80 Citations

View as Search Results

20 Claims

1. A training method for a speech recognizer comprising the steps of:
- receiving a band limited voice input utterance that is time varying;
  
  transforming said utterance using a fast fourier transform process to a frequency domain spectrum;
  
  forwarding said frequency domain spectrum to a plurality of mel filter banks, at least one of said plurality of mel filter banks having a plurality of sub-bands filtering said frequency spectrum;
  
  transforming an output of each of said plurality of mel-filter banks using an inverse discrete fourier transform process to obtain a processed speech output that is time varying from each of said mel-filter banks and an additional time varying output for each sub-band above one for each mel-filter bank;
  
  analyzing each output of each of time varying outputs of each inverse discrete fourier transform process using a respective linear prediction cepstral analysis to produce an individual feature vector output corresponding to each inverse discrete fourier transform output;
  
  appending said individual feature vectors forming a grand feature vector;
  
  conditioning said grand feature vector and removing any bias from said grand feature vector using a bias remover;
  
  performing MSE/GPD training on said grand feature vector after the bias is removed;
  
  building HMMs from said MSE/GPD training; and
  
  extracting a bias removal codebook of size four from the mean vectors of said HMMs for use with said bias removal in said signal conditioning of the grand feature vector.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein said transforming step includes pre-emphasizing, blocking speech into frames, frame windowing, and Fourier transformations.
  - 3. The method of claim 1 wherein said mel-filter banks having center frequencies of the filters spaced equally on a linear scale from 100 to 1000 Hz and equally on a logarithmic scale above 1000 Hz.
  - 4. The method of claim 3, wherein above 1000 Hz, each center frequency is 1.1 times the center frequency of the previous filter.
  - 5. The method of claim 4, wherein each filter'"'"'s magnitude frequency response has a triangular shape in the frequency domain that is equal to unity at the center frequency and linearly decreasing to zero at the center frequencies of any adjacent filter.
  - 6. The method of claim 5 wherein the frequency domain spectrum for each frame is passed through a set of M triangular mel-filter banks, where M is set to 24 for a preferred embodiment.
  - 7. The method of claim 1, wherein inverse discrete Fourier transforms are applied to smooth said frequency spectrum and to yield a plurality of autocorrelation coefficients.
  - 8. The method of claim 7, wherein said plurality of autocorrelation coefficients equals 10 for level 1 and 8 for level 2.
  - 9. The method of claim 1, wherein a final dimension of the cepstral vector is set to 12 cepstral features.
  - 10. The method of claim 9, wherein of said 12 cepstral features 6 features are from a lower subband and 6 features are from an upper sub-band).
  - 11. The method of claim 9 wherein of said 12 cepstral features 6 features are from level 1, 3 features from level 2 lower sub-band and 3 features from level 2 upper sub-band.
  - 12. The method of claim 1, wherein said cepstral vector has at least one feature from level 1 subband, at least one feature from a level 2 subband and at least one feature from a level 3 subband.
  - 13. The method of claim 1, wherein each input feature vector is extended beyond the 12 HSLPC features and the energy feature with the first and second order derivatives thereof resulting in a 39-dimensional feature vector.

14. A speech recognizer comprising:
- means for receiving a band limited voice input utterance that is time varying;
  
  means for transforming said utterance using a fast fourier transform process to a frequency domain spectrum;
  
  means for forwarding said frequency domain spectrum to a plurality of mel filter banks, at least one of said plurality of mel filter banks having a plurality of sub-bands filtering said frequency spectrum;
  
  means for transforming an output of each of said plurality of mel-filter banks using an inverse discrete fourier transform process to obtain a processed speech output that is time varying from each of said mel-filter banks and an additional time varying output for each sub-band above one for each mel-filter bank;
  
  means for analyzing each output of each of time varying outputs of each inverse discrete fourier transform process using a respective linear prediction cepstral analysis to produce an individual feature vector output corresponding to each inverse discrete fourier transform output;
  
  means for appending said individual feature vectors forming a grand feature vector;
  
  means for conditioning said grand feature vector and removing any bias from said grand feature vector using a bias remover; and
  
  means for decoding said grand feature vector after the bias is removed.
- View Dependent Claims (15)
- - 15. The speech recognizer of claim 14 wherein said decoding is performed on said grand feature vector using HMMs;
    - and bias removal codebooks.

16. A speech recognizer method comprising steps of:
- receiving a band limited voice input utterance that is time varying;
  
  transforming said utterance using a fast fourier transform process to a frequency domain spectrum;
  
  forwarding said frequency domain spectrum to a plurality of mel filter banks, at least one of said plurality of mel filter banks having a plurality of sub-bands filtering said frequency spectrum;
  
  transforming an output of each of said plurality of mel-filter banks using an inverse discrete fourier transform process to obtain a processed speech output that is time varying from each of said mel-filter banks and an additional time varying output for each sub-band above one for each mel-filter bank;
  
  analyzing each output of each of time varying outputs of each inverse discrete fourier transform process using a respective linear prediction cepstral analysis to produce an individual feature vector output corresponding to each inverse discrete fourier transform output;
  
  appending said individual feature vectors forming a grand feature vector;
  
  conditioning said grand feature vector and removing any bias from said grand feature vector using a bias remover; and
  
  decoding said grand feature vector after the bias is removed.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The speech recognizer method of claim 16 wherein said decoding step uses HMMs;
    - and bias removal codebooks.
  - 18. The speech recognizer method of claim 16 wherein said bias remover uses cepstral mean subtraction bias removal.
  - 19. The speech recognizer method of claim 16, wherein said bias remover uses hierarchical signal bias removal.
  - 20. The speech recognizer method of claim 16, wherein said bias remover uses cepstral mean subtraction bias removal for some features of the grand feature vector and hierarchical signal bias removal for the remaining features of the grand feature vector.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
WSOU Investments, LLC (WSOU Holdings, LLC)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Chengalvarayan, Rathinavelu
Primary Examiner(s)
Tsang, Fan
Assistant Examiner(s)
Opsasnick, Michael N.

Application Number

US09/266,958
Time in Patent Office

921 Days
Field of Search

704/219, 704/231, 704/239, 704/243, 704/249, 704/250
US Class Current

704/219
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/142   Hidden Markov Models [HMMs]

G10L 25/18   the extracted parameters be...

G10L 25/24   the extracted parameters be...

Hierarchial subband linear predictive cepstral features for HMM-based speech recognition

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

80 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Hierarchial subband linear predictive cepstral features for HMM-based speech recognition

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

80 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others