×

Frame-level combination of deep neural network and gaussian mixture models

  • US 9,240,184 B1
  • Filed: 02/12/2013
  • Issued: 01/19/2016
  • Est. Priority Date: 11/15/2012
  • Status: Active Grant
First Claim
Patent Images

1. A method comprising:

  • transforming an audio input signal, using one or more processors of a system, into a first sequence of feature vectors and a second sequence of feature vectors, both the first and second sequences of feature vectors corresponding in common to a sequence of temporal frames of the audio input signal, wherein each respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence bear quantitative measures of acoustic properties of a corresponding, respective temporal frame of the sequence of temporal frames of the audio input signal;

    processing the first sequence of feature vectors with a neural network (NN) implemented by the one or more processors of the system to generate a NN-based set of emission probabilities for a plurality of hidden Markov models (HMMs) implemented by the one or more processors of the system;

    processing the second sequence of feature vectors with a Gaussian mixture model (GMM) implemented by the one or more processors of the system to generate a GMM-based set of emission probabilities for the plurality of HMMs;

    by computing, for each temporal frame, weighted sums of the NN-based emission probabilities and the GMM-based emission probabilities, merging the NN-based set of emission probabilities with the GMM-based set of emission probabilities to generate a merged set of emission probabilities for the plurality of HMMs; and

    applying the merged set of emission probabilities to the plurality of HMMs to determine speech content corresponding to the sequence of temporal frames of the audio input signal,wherein the weighted sums are computed according to weights computationally-determined by at least one processor during to a training process that minimizes a computationally-determined difference between computationally-predicted speech in training temporal frames and predetermined speech in the training temporal frames.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×