Speech recognition

US 8,275,619 B2
Filed: 09/02/2009
Issued: 09/25/2012
Est. Priority Date: 09/03/2008
Status: Active Grant

First Claim

Patent Images

1. A processor implemented method for speech recognition of a speech signal comprising:

within a processor;

providing at least one codebook comprising codebook entries, in particular, multivariate Gaussians of feature vectors, that are frequency weighted; and

processing the speech signal for speech recognition comprising;

extracting at least one feature vector from the speech signal and matching the feature vector with the entries of the codebook;

providing at least one additional codebook comprising codebook entries, in particular, multivariate Gaussians of feature vectors, without frequency weights;

determining whether the speech signal corresponds to an utterance of a native speaker or to an utterance of a non-native speaker; and

if it is determined that the speech signal corresponds to the utterance of a native speaker, using the at least one additional codebook comprising codebook entries without frequency weights for the speech recognition;

orif it is determined that the speech signal corresponds to the utterance of a non-native speaker, using the at least one codebook comprising codebook entries that are frequency weighted for the speech recognition.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to a method for speech recognition of a speech signal comprising the steps of providing at least one codebook comprising codebook entries, in particular, multivariate Gaussians of feature vectors, that are frequency weighted such that higher weights are assigned to entries corresponding to frequencies below a predetermined level than to entries corresponding to frequencies above the predetermined level and processing the speech signal for speech recognition comprising extracting at least one feature vector from the speech signal and matching the feature vector with the entries of the codebook.

19 Citations

View as Search Results

13 Claims

1. A processor implemented method for speech recognition of a speech signal comprising:
- within a processor;
  
  providing at least one codebook comprising codebook entries, in particular, multivariate Gaussians of feature vectors, that are frequency weighted; and
  
  processing the speech signal for speech recognition comprising;
  
  extracting at least one feature vector from the speech signal and matching the feature vector with the entries of the codebook;
  
  providing at least one additional codebook comprising codebook entries, in particular, multivariate Gaussians of feature vectors, without frequency weights;
  
  determining whether the speech signal corresponds to an utterance of a native speaker or to an utterance of a non-native speaker; and
  
  if it is determined that the speech signal corresponds to the utterance of a native speaker, using the at least one additional codebook comprising codebook entries without frequency weights for the speech recognition;
  
  orif it is determined that the speech signal corresponds to the utterance of a non-native speaker, using the at least one codebook comprising codebook entries that are frequency weighted for the speech recognition.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method according to claim 1, wherein the entries of the at least one code-book are obtained by:
    - detecting training speech signals corresponding to utterances of one or more native speakers;
      
      transforming the training speech signals into a frequency domainapplying a MEL filterbank to the transformed training speech signals in a log-Mel domain and saving the log value of each bank to derive first representations of the training speech signals;
      
      applying a Discrete Cosine Transform to the first representations of the training speech signals to obtain second transformed training speech signals;
      
      applying a Linear Discriminant Analysis to the second transformed training speech signals to obtain third transformed training speech signals;
      
      extracting feature vectors from the third transformed training speech signals;
      
      determining covariance matrices for the extracted feature vectors in the Linear Discriminant Analysis domain;
      
      transforming the covariance matrices into the log-MEL domain;
      
      applying weights to the covariances of the covariance matrices in the log-MEL domain to obtain modified covariance matrices in the log-MEL domain; and
      
      transforming the modified covariance matrices from the log-MEL domain into the Linear Discriminant Analysis domain; and
      
      determining Gaussians from the modified covariance matrices to obtain the entries of the at least one codebook.
  - 3. The method according to claim 1, wherein the entries of the at least one code-book are obtained by:
    - detecting training speech signals corresponding to utterances of one or more native speakers;
      
      transforming the training speech signals into a frequency domain;
      
      applying a MEL filterbank to the transformed training speech signals in a log-Mel domain and saving the log value of each bank to derive first representations of the training speech signals;
      
      applying a Discrete Cosine Transform to the first representations of the training speech signals to obtain second transformed training speech signals;
      
      extracting feature vectors from the second transformed training speech signals;
      
      determining covariance matrices for the extracted feature vectors in the Discrete Cosine Transform domain;
      
      transforming the covariance matrices into the log-MEL domain;
      
      applying weights to the covariances of the covariance matrices in the log-MEL domain to obtain modified covariance matrices in the log-MEL domain; and
      
      transforming the modified covariance matrices from the log-MEL domain into the Linear Discriminant Analysis domain;
      
      determining Gaussians from the modified covariance matrices to obtain the entries of the at least one codebook.
  - 4. The method according to claim 1, further comprising:
    - transforming the speech signal into a frequency domain, in particular, considering only the envelope of the speech signal;
      
      applying a MEL filterbank to the transformed speech signal and save the log value of each bank to derive a first representation of the speech signal;
      
      applying a Discrete Cosine Transform to the first representation of the speech signal to obtain a second transformed speech signal;
      
      applying a Linear Discriminant Analysis to the second transformed speech signal to obtain a third transformed speech signal;
      
      extracting at least one feature vector from the third transformed speech signal; and
      
      recognizing the speech signal by determining a best matching Gaussian of the codebook for the extracted at least one feature vector.
  - 5. The method according to claim 1, further comprising:
    - generating likelihoods of sequences of speech segments by a Hidden Markov Model representing phonemes of speech; and
      
      recognizing at least one spoken word by means of the at least one codebook and the generated likelihoods.
  - 6. The method according to claim 1, wherein determining whether the speech signal corresponds to the utterance of a native speaker or to the utterance of a non-native speaker comprises:
    - either;
      
      a. automatically classifying the speech signal as a signal corresponding to an utterance by a native or a non-native speaker, orb. detecting an input by a user indicating that the speech signal corresponds to an utterance by a native or a non-native speaker.

7. A computer program product comprising a non-transitory computer readable medium having computer-executable instructions thereon for speech recognition of a speech signal comprising, the computer-executable instructions comprising:
- computer code for providing at least one codebook comprising code-book entries, in particular, multivariate Gaussians of feature vectors, that are frequency weighted; and
  
  computer code for processing the speech signal for speech recognition comprising;
  
  computer code for extracting at least one feature vector from the speech signal and computer code for matching the feature vector with the entries of the codebook;
  
  computer code for providing at least one additional codebook comprising codebook entries, in particular, multivariate Gaussians of feature vectors, without frequency weights;
  
  computer code for determining whether the speech signal corresponds to an utterance of a native speaker or to an utterance of a non-native speaker; and
  
  computer code for using the at least one additional codebook comprising codebook entries without frequency weights for the speech recognition if it is determined that the speech signal corresponds to the utterance of a native speaker;
  
  computer code for using the at least one codebook comprising codebook entries that are frequency weighted for the speech recognition if it is determined that the speech signal corresponds to the utterance of a non-native speaker.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. The computer program product according to claim 7, wherein the entries of the at least one code-book are obtained by:
    - detecting training speech signals corresponding to utterances of one or more native speakers;
      
      transforming the training speech signals into a frequency domain;
      
      applying a MEL filterbank to the transformed training speech signals and saving the log value of each bank to derive first representations of the training speech signals;
      
      applying a Discrete Cosine Transform to the first representations of the training speech signals to obtain second transformed training speech signals;
      
      applying a Linear Discriminant Analysis to the second transformed training speech signals to obtain third transformed training speech signals;
      
      extracting feature vectors from the third transformed training speech signals;
      
      determining covariance matrices for the extracted feature vectors in the Linear Discriminant Analysis domain;
      
      transforming the covariance matrices into the log-MEL domain;
      
      applying weights to the covariances of the covariance matrices in the log-MEL domain to obtain modified covariance matrices in the log-MEL domain;
      
      transforming the modified covariance matrices from the log-MEL domain into the Linear Discriminant Analysis domain; and
      
      determining Gaussians from the modified covariance matrices to obtain the entries of the at least one codebook.
  - 9. The computer program product according to claim 7, wherein the entries of the at least one code-book are obtained by:
    - detecting training speech signals corresponding to utterances of one or more native speakers;
      
      transforming the training speech signals into a frequency domain;
      
      applying a MEL filterbank to the transformed training speech signals and saving the log value of each bank to derive first representations of the training speech signals;
      
      applying a Discrete Cosine Transform to the first representations of the training speech signals to obtain second transformed training speech signals;
      
      extracting feature vectors from the second transformed training speech signals;
      
      determining covariance matrices for the extracted feature vectors in the Discrete Cosine Transform domain;
      
      transforming the covariance matrices into the log-MEL domain;
      
      applying weights to the covariances of the covariance matrices in the log-MEL domain to obtain modified covariance matrices in the log-MEL domain;
      
      transforming the modified covariance matrices from the log-MEL domain into the Linear Discriminant Analysis domain; and
      
      determining Gaussians from the modified covariance matrices to obtain the entries of the at least one codebook.
  - 10. The computer program product according to claim 7, further comprising:
    - computer code for transforming the speech signal into the frequency domain;
      
      computer code for applying a MEL filterbank to the transformed speech signal and save the log value of each bank to derive a first representation of the speech signal;
      
      computer code for applying a Discrete Cosine Transform to the first representation of the speech signal to obtain a second transformed speech signal;
      
      computer code for applying a Linear Discriminant Analysis to the second transformed speech signal to obtain a third transformed speech signal;
      
      computer code for extracting at least one feature vector from the third transformed speech signal; and
      
      computer code for recognizing the speech signal by determining a best matching Gaussian of the codebook for the extracted at least one feature vector.
  - 11. The computer program product according to claim 7, further comprising:
    - computer code for generating likelihoods of sequences of speech segments by a Hidden Markov Model representing phonemes of speech; and
      
      computer code for recognizing at least one spoken word by means of the at least one codebook and the generated likelihoods.
  - 12. The computer program product according to claim 11, wherein the computer code for determining whether the speech signal corresponds to the utterance of a native speaker or to the utterance of a non-native speaker comprises:
    - computer code for automatically classifying the speech signal as a signal corresponding to an utterance by a native or a non-native speaker.
  - 13. The computer program product according to claim 11 wherein the computer code for determining whether the speech signal corresponds to the utterance of a native speaker or to the utterance of a non-native speaker comprises:
    - computer code for detecting an input by a user indicating that the speech signal corresponds to an utterance by a native or a non-native speaker.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Herbig, Tobias, Raab, Martin, Brueckner, Raymond, Gruhn, Rainer
Primary Examiner(s)
He, Jialong

Application Number

US12/552,517
Publication Number

US 20100057462A1
Time in Patent Office

1,119 Days
Field of Search

704200-278
US Class Current

704/251
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/06   Creation of reference templ...

G10L 15/142   Hidden Markov Models [HMMs]

G10L 2015/025   Phonemes, fenemes or fenone...

Speech recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

19 Citations

13 Claims

Specification

Use Cases

Quick Links

Others

Speech recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

19 Citations

13 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others