Multi-frame prediction for hybrid neural network/hidden Markov models
First Claim
1. A method comprising:
- transforming an audio input signal, using one or more processors of a system, into a first time sequence of feature vectors, each respective feature vector of the first time sequence bearing quantitative measures of acoustic properties of a corresponding, respective temporal frame of a first sequence of temporal frames of the audio input signal;
providing, at a first time step, the first time sequence of feature vectors as input to a neural network (NN) implemented by the one or more processors of the system;
by concurrently processing the feature vectors in the first time sequence with the NN during the first time step, concurrently determining emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames, wherein concurrently determining the emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames comprises;
during a common time interval, determining for each feature vector in the first time sequence a respective set of emission probabilities for a first plurality of hidden Markov models (HMMs); and
concurrently applying the emission probabilities determined at the first time step for the feature vectors in the first time sequence to the first plurality of HMMs to determine speech content corresponding to the first sequence of temporal frames of the audio input signal.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for multi-frame prediction in a hybrid neural network/hidden Markov model automatic speech recognition (ASR) system is disclosed. An audio input signal may be transformed into a time sequence of feature vectors, each corresponding to respective temporal frame of a sequence of periodic temporal frames of the audio input signal. The time sequence of feature vectors may be concurrently input to a neural network, which may process them concurrently. In particular, the neural network may concurrently determine for the time sequence of feature vectors a set of emission probabilities for a plurality of hidden Markov models of the ASR system, where the set of emission probabilities are associated with the temporal frames. The set of emission probabilities may then be concurrently applied to the hidden Markov models for determining speech content of the audio input signal.
-
Citations
20 Claims
-
1. A method comprising:
-
transforming an audio input signal, using one or more processors of a system, into a first time sequence of feature vectors, each respective feature vector of the first time sequence bearing quantitative measures of acoustic properties of a corresponding, respective temporal frame of a first sequence of temporal frames of the audio input signal; providing, at a first time step, the first time sequence of feature vectors as input to a neural network (NN) implemented by the one or more processors of the system; by concurrently processing the feature vectors in the first time sequence with the NN during the first time step, concurrently determining emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames, wherein concurrently determining the emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames comprises;
during a common time interval, determining for each feature vector in the first time sequence a respective set of emission probabilities for a first plurality of hidden Markov models (HMMs); andconcurrently applying the emission probabilities determined at the first time step for the feature vectors in the first time sequence to the first plurality of HMMs to determine speech content corresponding to the first sequence of temporal frames of the audio input signal. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system comprising:
-
one or more processors; memory; and machine-readable instructions stored in the memory, that upon execution by the one or more processors cause the system to carry out operations comprising; transforming an audio input signal into a first time sequence of feature vectors, wherein each respective feature vector of the first time sequence bears quantitative measures of acoustic properties of a corresponding, respective temporal frame of a first sequence of temporal frames of the audio input signal, providing, at a first time step, the first time sequence of feature vectors as input to a neural network (NN) implemented by the system, by concurrently processing the feature vectors in the first time sequence with the NN during the first time step, concurrently determining emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames, wherein concurrently determining the emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames comprises;
during a common time interval, determining for each feature vector in the first time sequence a respective set of emission probabilities for a first plurality of hidden Markov models (HMMs), andconcurrently applying the emission probabilities determined at the first time step for the feature vectors in the first time sequence to the first plurality of HMMs to determine speech content corresponding to the first sequence of temporal frames of the audio input signal. - View Dependent Claims (12, 13, 14)
-
-
15. An article of manufacture including a non-transitory, computer-readable storage medium, haying stored thereon program instructions that, upon execution by one or more processors of a system, cause the system to perform operations comprising:
-
transforming an audio input signal into a first time sequence of feature vectors, wherein each respective feature vector of the first time sequence bears quantitative measures of acoustic properties of a corresponding, respective temporal frame of a first sequence of temporal frames of the audio input signal; providing, at a first time step, the first time sequence of feature vectors as input to a neural network (NN) implemented by the system; by concurrently processing the feature vectors in the first time sequence with the NN during the first time step, concurrently determining emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames, wherein concurrently determining the emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames comprises;
during a common time interval, determining for each feature vector in the first time sequence a respective set of emission probabilities for a first plurality of hidden Markov models (HMMs); andconcurrently applying the emission probabilities determined at the first time step for the feature vectors in the first time sequence to the first plurality of HMMs to determine speech content corresponding to the first sequence of temporal frames of the audio input signal. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification