Frame-level combination of deep neural network and gaussian mixture models

US 9,240,184 B1
Filed: 02/12/2013
Issued: 01/19/2016
Est. Priority Date: 11/15/2012
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

transforming an audio input signal, using one or more processors of a system, into a first sequence of feature vectors and a second sequence of feature vectors, both the first and second sequences of feature vectors corresponding in common to a sequence of temporal frames of the audio input signal, wherein each respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence bear quantitative measures of acoustic properties of a corresponding, respective temporal frame of the sequence of temporal frames of the audio input signal;

processing the first sequence of feature vectors with a neural network (NN) implemented by the one or more processors of the system to generate a NN-based set of emission probabilities for a plurality of hidden Markov models (HMMs) implemented by the one or more processors of the system;

processing the second sequence of feature vectors with a Gaussian mixture model (GMM) implemented by the one or more processors of the system to generate a GMM-based set of emission probabilities for the plurality of HMMs;

by computing, for each temporal frame, weighted sums of the NN-based emission probabilities and the GMM-based emission probabilities, merging the NN-based set of emission probabilities with the GMM-based set of emission probabilities to generate a merged set of emission probabilities for the plurality of HMMs; and

applying the merged set of emission probabilities to the plurality of HMMs to determine speech content corresponding to the sequence of temporal frames of the audio input signal,wherein the weighted sums are computed according to weights computationally-determined by at least one processor during to a training process that minimizes a computationally-determined difference between computationally-predicted speech in training temporal frames and predetermined speech in the training temporal frames.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for frame-level merging of HMM state predictions determined by different techniques is disclosed. An audio input signal may be transformed into a first and second sequence of feature vector, the sequences corresponding to each other and to a temporal sequence of frames of the audio input signal on a frame-by-frame basis. The first sequence may be processed by a neural network (NN) to determine NN-based state predictions, and the second sequence may be processed by a Gaussian mixture model (GMM) to determine GMM-based state predictions. The NN-based and GMM-based state predictions may be merged as weighted sums for each of a plurality of HMM state on a frame-by-frame basis to determine merged state predictions. The merged state predictions may then be applied to the HMMs to speech content of the audio input signal.

Citations

23 Claims

1. A method comprising:
- transforming an audio input signal, using one or more processors of a system, into a first sequence of feature vectors and a second sequence of feature vectors, both the first and second sequences of feature vectors corresponding in common to a sequence of temporal frames of the audio input signal, wherein each respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence bear quantitative measures of acoustic properties of a corresponding, respective temporal frame of the sequence of temporal frames of the audio input signal;
  
  processing the first sequence of feature vectors with a neural network (NN) implemented by the one or more processors of the system to generate a NN-based set of emission probabilities for a plurality of hidden Markov models (HMMs) implemented by the one or more processors of the system;
  
  processing the second sequence of feature vectors with a Gaussian mixture model (GMM) implemented by the one or more processors of the system to generate a GMM-based set of emission probabilities for the plurality of HMMs;
  
  by computing, for each temporal frame, weighted sums of the NN-based emission probabilities and the GMM-based emission probabilities, merging the NN-based set of emission probabilities with the GMM-based set of emission probabilities to generate a merged set of emission probabilities for the plurality of HMMs; and
  
  applying the merged set of emission probabilities to the plurality of HMMs to determine speech content corresponding to the sequence of temporal frames of the audio input signal,wherein the weighted sums are computed according to weights computationally-determined by at least one processor during to a training process that minimizes a computationally-determined difference between computationally-predicted speech in training temporal frames and predetermined speech in the training temporal frames.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein at least two of (i) the NN, (ii) the GMMs, and (iii) the HMMs are implemented by at least one common processor from among the one or more processors of the system.
  - 3. The method of claim 1, wherein the quantitative measures of acoustic properties are at least one of Mel Filter Cepstral coefficients, Perceptual Linear Predictive coefficients, Relative Spectral coefficients, or Filterbank log-energy coefficients.
  - 4. The method of claim 1, wherein the first and second sequences are duplicates of each other.
  - 5. The method of claim 1, wherein the feature vectors of the first sequence bear different quantitative measures of the acoustic properties of the temporal frames of the sequence of temporal frames of the audio input signal than the feature vectors of the second sequence.
  - 6. The method of claim 1, wherein the plurality of HMMs collectively comprise a multiplicity of states,wherein generating the NN-based set of emission probabilities for the plurality of the HMMs comprises:
    - for each respective feature vector of the first sequence, determining, for each respective state of the multiplicity of states, a respective NN-based conditional probability of emitting the respective feature vector of the first sequence given the respective state,and wherein generating the GMM-based set of emission probabilities for the plurality of the GMMs comprises;
      
      for each respective feature vector of the second sequence, determining, for each respective state of the multiplicity of states, a respective GMM-based conditional probability of emitting the respective feature vector of the second sequence given the respective state.
  - 7. The method of claim 6, wherein computing, for each temporal frame, the weighted sums of the NN-based emission probabilities and the GMM-based emission probabilities comprises:
    - for each pair of a respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence, determining, for each respective state of the multiplicity, a weighted sum of the respective NN-based conditional probability and the respective GMM-based conditional probability.
  - 8. The method of claim 6, wherein each of the HMMs in the plurality is associated with a respective elemental speech unit, and has one or more states corresponding to one or more temporal phases of the associated, respective elemental speech unit,wherein the multiplicity of states comprises a collection of the one or more states of each of the HMMs in the plurality,and wherein determining the speech content corresponding to the sequence of temporal frames of the audio input signal comprises determining a probable sequence of elemental speech units based on a most likely sequence of states of the multiplicity.
  - 9. The method of claim 8, wherein each elemental speech unit is at least one of a phoneme, a triphone, or a quinphone.
  - 10. The method of claim 1, wherein determining speech content is at least one of generating a text string of the speech content, or identifying a computer-executable command based on the speech content.

11. A method comprising:
- transforming an audio input signal, using one or more processors of a system, into a first sequence of feature vectors and a second sequence of feature vectors, both the first and second sequences of feature vectors corresponding in common to a sequence of temporal frames of the audio input signal, wherein each respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence bear quantitative measures of acoustic properties of a corresponding, respective temporal frame of the sequence of temporal frames of the audio input signal;
  
  processing each respective feature vector of the first sequence with a neural network (NN) implemented by the one or more processors of the system to determine, for each respective state of a multiplicity of states of hidden Markov models (HMMs) implemented by the one or more processors of the system, a respective NN-based conditional probability of emitting the respective feature vector of the first sequence given the respective state;
  
  processing each respective feature vector of the second sequence with a Gaussian mixture model (GMM) implemented by the one or more processors of the system to determine, for each respective state of the multiplicity of states, a respective GMM-based conditional probability of emitting the respective feature vector of the second sequence given the respective state;
  
  for each pair of a respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence, determining, for each respective state of the multiplicity of states, a respective weighted sum of the respective NN-based conditional probability and the respective GMM-based conditional probability, each respective weighted sum being one of a set of weighted-sum emission probabilities for the multiplicity of states; and
  
  computationally determining, by at least one processor, weights of the set of weighted-sum emission probabilities in order to reduce a difference computed by the at least one processor between (i) predicted speech content of the sequence of temporal frames computationally determined by the at least one processor by applying the set of weighted-sum emission probabilities to the multiplicity of states and (ii) pre-determined speech content of the sequence of temporal frames.

12. A system comprising:
- one or more processors;
  
  memory; and
  
  machine-readable instructions stored in the memory, that upon execution by the one or more processors cause the system to carry out operations comprising;
  
  transforming an audio input signal into a first sequence of feature vectors and a second sequence of feature vectors, wherein both the first and second sequences of feature vectors correspond in common to a sequence of temporal frames of the audio input signal, and wherein each respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence bear quantitative measures of acoustic properties of a corresponding, respective temporal frame of the sequence of temporal frames of the audio input signal,processing the first sequence of feature vectors with a neural network (NN) implemented by the system to generate a NN-based set of emission probabilities for a plurality of hidden Markov models (HMMs) implemented by the system,processing the second sequence of feature vectors with a Gaussian mixture model (GMM) implemented by the system to generate a GMM-based set of emission probabilities for the plurality of HMMs,by computing, for each temporal frame, weighted sums of the NN-based emission probabilities and the GMM-based emission probabilities, merging the NN-based set of emission probabilities with the GMM-based set of emission probabilities to generate a merged set of emission probabilities for the plurality of HMMs, andapplying the merged set of emission probabilities to the plurality of HMMs to determine speech content corresponding to the sequence of temporal frames of the audio input signal,wherein the weighted sums are computed according to weights computationally-determined by at least one processor during to a training process that minimizes a computationally-determined difference between computationally-predicted speech in training temporal frames and predetermined speech in the training temporal frames.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The system of claim 12, wherein the quantitative measures of acoustic properties are at least one of Mel Filter Cepstral coefficients, Perceptual Linear Predictive coefficients, Relative Spectral coefficients, or Filterbank log-energy coefficients.
  - 14. The system of claim 12, wherein the plurality of HMMs collectively comprise a multiplicity of states,wherein generating the NN-based set of emission probabilities for the plurality of the HMMs comprises:
    - for each respective feature vector of the first sequence, determining, for each respective state of the multiplicity of states, a respective NN-based conditional probability of emitting the respective feature vector of the first sequence given the respective state,wherein generating the GMM-based set of emission probabilities for the plurality of the GMMs comprises;
      
      for each respective feature vector of the second sequence, determining, for each respective state of the multiplicity of states, a respective GMM-based conditional probability of emitting the respective feature vector of the second sequence given the respective state,and wherein computing, for each temporal frame, the weighted sums of the NN-based emission probabilities and the GMM-based emission probabilities comprises;
      
      for each pair of a respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence, determining, for each respective state of the multiplicity, a weighted sum of the respective NN-based conditional probability and the respective GMM-based conditional probability.
  - 15. The system of claim 12, wherein each of the HMMs in the plurality is associated with a respective elemental speech unit, and has one or more states corresponding to one or more temporal phases of the associated, respective elemental speech unit,wherein the multiplicity of states comprises a collection of the one or more states of each of the HMMs in the plurality,and wherein determining the speech content corresponding to the sequence of temporal frames of the audio input signal comprises determining a probable sequence of elemental speech units based on a most likely sequence of states of the multiplicity.
  - 16. The system of claim 12, wherein determining speech content is at least one of generating a text string of the speech content, or identifying a computer-executable command based on the speech content.

17. A tangible, non-transitory computer-readable storage medium, having stored thereon program instructions that, upon execution by one or more processors of a system, cause the system to perform operations comprising:
- transforming an audio input signal into a first sequence of feature vectors and a second sequence of feature vectors, wherein both the first and second sequences of feature vectors correspond in common to a sequence of temporal frames of the audio input signal, and wherein each respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence bear quantitative measures of acoustic properties of a corresponding, respective temporal frame of the sequence of temporal frames of the audio input signal;
  
  processing the first sequence of feature vectors with a neural network (NN) implemented by the system to generate a NN-based set of emission probabilities for a plurality of hidden Markov models (HMMs) implemented by the system;
  
  processing the second sequence of feature vectors with a Gaussian mixture model (GMM) implemented by the system to generate a GMM-based set of emission probabilities for the plurality of HMMsby computing, for each temporal frame, weighted sums of the NN-based emission probabilities and the GMM-based emission probabilities, merging the NN-based set of emission probabilities with the GMM-based set of emission probabilities to generate a merged set of emission probabilities for the plurality of HMMs; and
  
  applying the merged set of emission probabilities to the plurality of HMMs to determine speech content corresponding to the sequence of temporal frames of the audio input signal,wherein the weighted sums are computed according to weights computationally-determined by at least one processor during to a training process that minimizes a computationally-determined difference between computationally-predicted speech in training temporal frames and predetermined speech in the training temporal frames.
- View Dependent Claims (18, 19, 20, 21, 22)
- - 18. The tangible, non-transitory computer-readable storage medium of claim 17, wherein the quantitative measures of acoustic properties are at least one of Mel Filter Cepstral coefficients, Perceptual Linear Predictive coefficients, Relative Spectral coefficients, or Filterbank log-energy coefficients.
  - 19. The tangible, non-transitory computer-readable storage medium of claim 17, wherein the plurality of HMMs collectively comprise a multiplicity of states,wherein generating the NN-based set of emission probabilities for the plurality of the HMMs comprises:
    - for each respective feature vector of the first sequence, determining, for each respective state of the multiplicity of states, a respective NN-based conditional probability of emitting the respective feature vector of the first sequence given the respective state,wherein generating the GMM-based set of emission probabilities for the plurality of the GMMs comprises;
      
      for each respective feature vector of the second sequence, determining, for each respective state of the multiplicity of states, a respective GMM-based conditional probability of emitting the respective feature vector of the second sequence given the respective state,and wherein computing, for each temporal frame, the weighted sums of the NN-based emission probabilities and the GMM-based emission probabilities comprises;
      
      for each pair of a respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence, determining, for each respective state of the multiplicity, a weighted sum of the respective NN-based conditional probability and the respective GMM-based conditional probability.
  - 20. The tangible, non-transitory computer-readable storage medium of claim 17, wherein each of the HMMs in the plurality is associated with a respective elemental speech unit, and has one or more states corresponding to one or more temporal phases of the associated, respective elemental speech unit,wherein the multiplicity of states comprises a collection of the one or more states of each of the HMMs in the plurality,and wherein determining the speech content corresponding to the sequence of temporal frames of the audio input signal comprises determining a probable sequence of elemental speech units based on a most likely sequence of states of the multiplicity.
  - 21. The tangible, non-transitory computer-readable storage medium of claim 20, wherein each elemental speech unit is at least one of a phoneme, a triphone, or a quinphone.
  - 22. The tangible, non-transitory computer-readable storage medium of claim 17, wherein determining speech content is at least one of generating a text string of the speech content, or identifying a computer-executable command based on the speech content.

23. A tangible, non-transitory computer-readable storage medium, having stored thereon program instructions that, upon execution by one or more processors of a system, cause the system to perform operations comprising:
- transforming an audio input signal into a first sequence of feature vectors and a second sequence of feature vectors, wherein both the first and second sequences of feature vectors correspond in common to a sequence of temporal frames of the audio input signal, and wherein each respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence bear quantitative measures of acoustic properties of a corresponding, respective temporal frame of the sequence of temporal frames of the audio input signal;
  
  processing each respective feature vector of the first sequence with a neural network (NN) implemented by the system to determine, for each respective state of a multiplicity of states of hidden Markov models (HMMs) implemented by the system, a respective NN-based conditional probability of emitting the respective feature vector of the first sequence given the respective state;
  
  processing each respective feature vector of the second sequence with a Gaussian mixture model (GMM) implemented by the system to determine, for each respective state of the multiplicity of states, a respective GMM-based conditional probability of emitting the respective feature vector of the second sequence given the respective state;
  
  for each pair of a respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence, determining, for each respective state of the multiplicity of states, a respective weighted sum of the respective NN-based conditional probability and the respective GMM-based conditional probability, each respective weighted sum being one of a set of weighted-sum emission probabilities for the multiplicity of states; and
  
  computationally determining weights of the set of weighted-sum emission probabilities in order to reduce a difference computed by the at least one processor between (i) predicted speech content of the sequence of temporal frames computationally determined by the at least one processor by applying the set of weighted-sum emission probabilities to the multiplicity of states and (ii) pre-determined speech content of the sequence of temporal frames.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Lin, Hui, Lei, Xin, Vanhoucke, Vincent
Primary Examiner(s)
Huntsinger, Peter K
Assistant Examiner(s)
Leland, III, Edwin S

Application Number

US13/765,002
Time in Patent Office

1,071 Days
Field of Search

704/232, 358/1.15, 358/1.16, 358/400, 358/402, 358/474, 358/475, 358/488, 358/498, 700/19, 707/736, 709/203, 709/217, 709/218, 709/220, 709/223, 709/224, 709/227, 709/245, 710/15, 710/19, 710/8, 719/310, 719/321, 719/328
US Class Current

1/1
CPC Class Codes

G10L 15/142 Hidden Markov Models [HMMs]

G10L 15/22 Procedures used during a sp...

Frame-level combination of deep neural network and gaussian mixture models

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Frame-level combination of deep neural network and gaussian mixture models

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links