Frame-level combination of deep neural network and gaussian mixture models
First Claim
1. A method comprising:
- transforming an audio input signal, using one or more processors of a system, into a first sequence of feature vectors and a second sequence of feature vectors, both the first and second sequences of feature vectors corresponding in common to a sequence of temporal frames of the audio input signal, wherein each respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence bear quantitative measures of acoustic properties of a corresponding, respective temporal frame of the sequence of temporal frames of the audio input signal;
processing the first sequence of feature vectors with a neural network (NN) implemented by the one or more processors of the system to generate a NN-based set of emission probabilities for a plurality of hidden Markov models (HMMs) implemented by the one or more processors of the system;
processing the second sequence of feature vectors with a Gaussian mixture model (GMM) implemented by the one or more processors of the system to generate a GMM-based set of emission probabilities for the plurality of HMMs;
by computing, for each temporal frame, weighted sums of the NN-based emission probabilities and the GMM-based emission probabilities, merging the NN-based set of emission probabilities with the GMM-based set of emission probabilities to generate a merged set of emission probabilities for the plurality of HMMs; and
applying the merged set of emission probabilities to the plurality of HMMs to determine speech content corresponding to the sequence of temporal frames of the audio input signal,wherein the weighted sums are computed according to weights computationally-determined by at least one processor during to a training process that minimizes a computationally-determined difference between computationally-predicted speech in training temporal frames and predetermined speech in the training temporal frames.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for frame-level merging of HMM state predictions determined by different techniques is disclosed. An audio input signal may be transformed into a first and second sequence of feature vector, the sequences corresponding to each other and to a temporal sequence of frames of the audio input signal on a frame-by-frame basis. The first sequence may be processed by a neural network (NN) to determine NN-based state predictions, and the second sequence may be processed by a Gaussian mixture model (GMM) to determine GMM-based state predictions. The NN-based and GMM-based state predictions may be merged as weighted sums for each of a plurality of HMM state on a frame-by-frame basis to determine merged state predictions. The merged state predictions may then be applied to the HMMs to speech content of the audio input signal.
-
Citations
23 Claims
-
1. A method comprising:
-
transforming an audio input signal, using one or more processors of a system, into a first sequence of feature vectors and a second sequence of feature vectors, both the first and second sequences of feature vectors corresponding in common to a sequence of temporal frames of the audio input signal, wherein each respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence bear quantitative measures of acoustic properties of a corresponding, respective temporal frame of the sequence of temporal frames of the audio input signal; processing the first sequence of feature vectors with a neural network (NN) implemented by the one or more processors of the system to generate a NN-based set of emission probabilities for a plurality of hidden Markov models (HMMs) implemented by the one or more processors of the system; processing the second sequence of feature vectors with a Gaussian mixture model (GMM) implemented by the one or more processors of the system to generate a GMM-based set of emission probabilities for the plurality of HMMs; by computing, for each temporal frame, weighted sums of the NN-based emission probabilities and the GMM-based emission probabilities, merging the NN-based set of emission probabilities with the GMM-based set of emission probabilities to generate a merged set of emission probabilities for the plurality of HMMs; and applying the merged set of emission probabilities to the plurality of HMMs to determine speech content corresponding to the sequence of temporal frames of the audio input signal, wherein the weighted sums are computed according to weights computationally-determined by at least one processor during to a training process that minimizes a computationally-determined difference between computationally-predicted speech in training temporal frames and predetermined speech in the training temporal frames. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method comprising:
-
transforming an audio input signal, using one or more processors of a system, into a first sequence of feature vectors and a second sequence of feature vectors, both the first and second sequences of feature vectors corresponding in common to a sequence of temporal frames of the audio input signal, wherein each respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence bear quantitative measures of acoustic properties of a corresponding, respective temporal frame of the sequence of temporal frames of the audio input signal; processing each respective feature vector of the first sequence with a neural network (NN) implemented by the one or more processors of the system to determine, for each respective state of a multiplicity of states of hidden Markov models (HMMs) implemented by the one or more processors of the system, a respective NN-based conditional probability of emitting the respective feature vector of the first sequence given the respective state; processing each respective feature vector of the second sequence with a Gaussian mixture model (GMM) implemented by the one or more processors of the system to determine, for each respective state of the multiplicity of states, a respective GMM-based conditional probability of emitting the respective feature vector of the second sequence given the respective state; for each pair of a respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence, determining, for each respective state of the multiplicity of states, a respective weighted sum of the respective NN-based conditional probability and the respective GMM-based conditional probability, each respective weighted sum being one of a set of weighted-sum emission probabilities for the multiplicity of states; and computationally determining, by at least one processor, weights of the set of weighted-sum emission probabilities in order to reduce a difference computed by the at least one processor between (i) predicted speech content of the sequence of temporal frames computationally determined by the at least one processor by applying the set of weighted-sum emission probabilities to the multiplicity of states and (ii) pre-determined speech content of the sequence of temporal frames.
-
-
12. A system comprising:
-
one or more processors; memory; and machine-readable instructions stored in the memory, that upon execution by the one or more processors cause the system to carry out operations comprising; transforming an audio input signal into a first sequence of feature vectors and a second sequence of feature vectors, wherein both the first and second sequences of feature vectors correspond in common to a sequence of temporal frames of the audio input signal, and wherein each respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence bear quantitative measures of acoustic properties of a corresponding, respective temporal frame of the sequence of temporal frames of the audio input signal, processing the first sequence of feature vectors with a neural network (NN) implemented by the system to generate a NN-based set of emission probabilities for a plurality of hidden Markov models (HMMs) implemented by the system, processing the second sequence of feature vectors with a Gaussian mixture model (GMM) implemented by the system to generate a GMM-based set of emission probabilities for the plurality of HMMs, by computing, for each temporal frame, weighted sums of the NN-based emission probabilities and the GMM-based emission probabilities, merging the NN-based set of emission probabilities with the GMM-based set of emission probabilities to generate a merged set of emission probabilities for the plurality of HMMs, and applying the merged set of emission probabilities to the plurality of HMMs to determine speech content corresponding to the sequence of temporal frames of the audio input signal, wherein the weighted sums are computed according to weights computationally-determined by at least one processor during to a training process that minimizes a computationally-determined difference between computationally-predicted speech in training temporal frames and predetermined speech in the training temporal frames. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A tangible, non-transitory computer-readable storage medium, having stored thereon program instructions that, upon execution by one or more processors of a system, cause the system to perform operations comprising:
-
transforming an audio input signal into a first sequence of feature vectors and a second sequence of feature vectors, wherein both the first and second sequences of feature vectors correspond in common to a sequence of temporal frames of the audio input signal, and wherein each respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence bear quantitative measures of acoustic properties of a corresponding, respective temporal frame of the sequence of temporal frames of the audio input signal; processing the first sequence of feature vectors with a neural network (NN) implemented by the system to generate a NN-based set of emission probabilities for a plurality of hidden Markov models (HMMs) implemented by the system; processing the second sequence of feature vectors with a Gaussian mixture model (GMM) implemented by the system to generate a GMM-based set of emission probabilities for the plurality of HMMs by computing, for each temporal frame, weighted sums of the NN-based emission probabilities and the GMM-based emission probabilities, merging the NN-based set of emission probabilities with the GMM-based set of emission probabilities to generate a merged set of emission probabilities for the plurality of HMMs; and applying the merged set of emission probabilities to the plurality of HMMs to determine speech content corresponding to the sequence of temporal frames of the audio input signal, wherein the weighted sums are computed according to weights computationally-determined by at least one processor during to a training process that minimizes a computationally-determined difference between computationally-predicted speech in training temporal frames and predetermined speech in the training temporal frames. - View Dependent Claims (18, 19, 20, 21, 22)
-
-
23. A tangible, non-transitory computer-readable storage medium, having stored thereon program instructions that, upon execution by one or more processors of a system, cause the system to perform operations comprising:
-
transforming an audio input signal into a first sequence of feature vectors and a second sequence of feature vectors, wherein both the first and second sequences of feature vectors correspond in common to a sequence of temporal frames of the audio input signal, and wherein each respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence bear quantitative measures of acoustic properties of a corresponding, respective temporal frame of the sequence of temporal frames of the audio input signal; processing each respective feature vector of the first sequence with a neural network (NN) implemented by the system to determine, for each respective state of a multiplicity of states of hidden Markov models (HMMs) implemented by the system, a respective NN-based conditional probability of emitting the respective feature vector of the first sequence given the respective state; processing each respective feature vector of the second sequence with a Gaussian mixture model (GMM) implemented by the system to determine, for each respective state of the multiplicity of states, a respective GMM-based conditional probability of emitting the respective feature vector of the second sequence given the respective state; for each pair of a respective feature vector of the first sequence and a corresponding respective feature vector of the second sequence, determining, for each respective state of the multiplicity of states, a respective weighted sum of the respective NN-based conditional probability and the respective GMM-based conditional probability, each respective weighted sum being one of a set of weighted-sum emission probabilities for the multiplicity of states; and computationally determining weights of the set of weighted-sum emission probabilities in order to reduce a difference computed by the at least one processor between (i) predicted speech content of the sequence of temporal frames computationally determined by the at least one processor by applying the set of weighted-sum emission probabilities to the multiplicity of states and (ii) pre-determined speech content of the sequence of temporal frames.
-
Specification