Multi-frame prediction for hybrid neural network/hidden Markov models

US 8,442,821 B1
Filed: 07/27/2012
Issued: 05/14/2013
Est. Priority Date: 07/27/2012
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

transforming an audio input signal, using one or more processors of a system, into a first time sequence of feature vectors, each respective feature vector of the first time sequence bearing quantitative measures of acoustic properties of a corresponding, respective temporal frame of a first sequence of temporal frames of the audio input signal;

providing, at a first time step, the first time sequence of feature vectors as input to a neural network (NN) implemented by the one or more processors of the system;

by concurrently processing the feature vectors in the first time sequence with the NN during the first time step, concurrently determining emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames, wherein concurrently determining the emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames comprises;

during a common time interval, determining for each feature vector in the first time sequence a respective set of emission probabilities for a first plurality of hidden Markov models (HMMs); and

concurrently applying the emission probabilities determined at the first time step for the feature vectors in the first time sequence to the first plurality of HMMs to determine speech content corresponding to the first sequence of temporal frames of the audio input signal.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for multi-frame prediction in a hybrid neural network/hidden Markov model automatic speech recognition (ASR) system is disclosed. An audio input signal may be transformed into a time sequence of feature vectors, each corresponding to respective temporal frame of a sequence of periodic temporal frames of the audio input signal. The time sequence of feature vectors may be concurrently input to a neural network, which may process them concurrently. In particular, the neural network may concurrently determine for the time sequence of feature vectors a set of emission probabilities for a plurality of hidden Markov models of the ASR system, where the set of emission probabilities are associated with the temporal frames. The set of emission probabilities may then be concurrently applied to the hidden Markov models for determining speech content of the audio input signal.

Citations

20 Claims

1. A method comprising:
- transforming an audio input signal, using one or more processors of a system, into a first time sequence of feature vectors, each respective feature vector of the first time sequence bearing quantitative measures of acoustic properties of a corresponding, respective temporal frame of a first sequence of temporal frames of the audio input signal;
  
  providing, at a first time step, the first time sequence of feature vectors as input to a neural network (NN) implemented by the one or more processors of the system;
  
  by concurrently processing the feature vectors in the first time sequence with the NN during the first time step, concurrently determining emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames, wherein concurrently determining the emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames comprises;
  
  during a common time interval, determining for each feature vector in the first time sequence a respective set of emission probabilities for a first plurality of hidden Markov models (HMMs); and
  
  concurrently applying the emission probabilities determined at the first time step for the feature vectors in the first time sequence to the first plurality of HMMs to determine speech content corresponding to the first sequence of temporal frames of the audio input signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising:
    - providing, at a second time step, a second time sequence of feature vectors as input to the NN, the second time sequence corresponding to a second sequence of temporal frames of the audio input signal, wherein the second time step follows the first time step by a multiple number of temporal frame periods;
      
      processing the feature vectors in the second time sequence concurrently by the NN, wherein processing the feature vectors in the second time sequence concurrently by the NN comprises determining, at the second time step, for each feature vector in the second time sequence a respective set of emission probabilities for a second plurality of HMMs; and
      
      applying the emission probabilities determined at the second time step for the feature vectors in the second time sequence to the second plurality of HMMs to determine speech content corresponding to the second sequence of temporal frames of the audio input signal.
  - 3. The method of claim 2, wherein the first plurality of HMMs and the second plurality of HMMs share at least one HMM in common.
  - 4. The method of claim 1, wherein the NN and the first plurality of HMMs are implemented by at least one common processor from among the one or more processors of the system.
  - 5. The method of claim 1, wherein the quantitative measures of acoustic properties are at least one of Mel Filter Cepstral coefficients, Perceptual Linear Predictive coefficients, Relative Spectral coefficients, and Filterbank log-energy coefficients.
  - 6. The method of claim 1, wherein providing the first time sequence of feature vectors as input to the NN comprises:
    - providing concurrently with the first time sequence of feature vectors at least one feature vector corresponding to a temporal frame that temporally precedes the first sequence of temporal frames; and
      
      providing concurrently with the first time sequence of feature vectors at least one feature vector corresponding to a temporal frame that temporally follows the first sequence of temporal frames.
  - 7. The method of claim 1, wherein the first plurality of HMMs collectively comprise a multiplicity of states, and wherein determining for each feature vector in the first time sequence a respective set of emission probabilities for the first plurality of HMMs comprises:
    - for each respective feature vector in the first time sequence, determining, at the first time step, for each respective state of the multiplicity of states a respective conditional probability of emitting the respective feature vector given the respective state.
  - 8. The method of claim 7, wherein each of the HMMs in the first plurality is associated with a respective elemental speech unit, and has one or more states corresponding to one or more temporal phases of the associated, respective elemental speech unit,wherein the multiplicity of states comprises a collection of the one or more states of each of the HMMs in the first plurality,and wherein determining speech content corresponding to the first sequence of temporal frames of the audio input signal comprises determining a probable sequence of elemental speech units based on a most likely sequence of states of the multiplicity.
  - 9. The method of claim 8, wherein each elemental speech unit in the probable sequence of elemental speech units is a phoneme, triphone, or quinphone.
  - 10. The method of claim 1, wherein determining speech content corresponding to the first sequence of temporal frames of the audio input signal comprises at least one of generating a text string of the speech content and identifying a computer-executable command based on the speech content.

11. A system comprising:
- one or more processors;
  
  memory; and
  
  machine-readable instructions stored in the memory, that upon execution by the one or more processors cause the system to carry out operations comprising;
  
  transforming an audio input signal into a first time sequence of feature vectors, wherein each respective feature vector of the first time sequence bears quantitative measures of acoustic properties of a corresponding, respective temporal frame of a first sequence of temporal frames of the audio input signal,providing, at a first time step, the first time sequence of feature vectors as input to a neural network (NN) implemented by the system,by concurrently processing the feature vectors in the first time sequence with the NN during the first time step, concurrently determining emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames, wherein concurrently determining the emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames comprises;
  
  during a common time interval, determining for each feature vector in the first time sequence a respective set of emission probabilities for a first plurality of hidden Markov models (HMMs), andconcurrently applying the emission probabilities determined at the first time step for the feature vectors in the first time sequence to the first plurality of HMMs to determine speech content corresponding to the first sequence of temporal frames of the audio input signal.
- View Dependent Claims (12, 13, 14)
- - 12. The system of claim 11, wherein the operations further comprise:
    - providing, at a second time step, a second time sequence of feature vectors as input to the NN, wherein the second time sequence corresponds to a second sequence of temporal frames of the audio input signal, and wherein the second time step follows the first time step by a multiple number of temporal frame periods,processing the feature vectors in the second time sequence concurrently by the NN, wherein processing the feature vectors in the second time sequence concurrently by the NN comprises determining, at the second time step, for each feature vector in the second time sequence a respective set of emission probabilities for a second plurality of HMMs, andapplying the emission probabilities determined at the second time step for the feature vectors in the second time sequence to the second plurality of HMMs to determine speech content corresponding to the second sequence of temporal frames of the audio input signal.
  - 13. The system of claim 11, wherein the quantitative measures of acoustic properties are at least one of Mel Filter Cepstral coefficients, Perceptual Linear Predictive coefficients, Relative Spectral coefficients, and Filterbank log-energy coefficients.
  - 14. The system of claim 11, wherein each of the HMMs in the first plurality is associated with a respective elemental speech unit and has one or more states corresponding to one or more temporal phases of the associated, respective elemental speech unit,and wherein determining the speech content corresponding the first sequence of temporal frames of the audio input signal comprises determining a probable sequence of elemental speech units based on a most likely sequence of states from among the one or more states of each of the HMMs in the first plurality.

15. An article of manufacture including a non-transitory, computer-readable storage medium, haying stored thereon program instructions that, upon execution by one or more processors of a system, cause the system to perform operations comprising:
- transforming an audio input signal into a first time sequence of feature vectors, wherein each respective feature vector of the first time sequence bears quantitative measures of acoustic properties of a corresponding, respective temporal frame of a first sequence of temporal frames of the audio input signal;
  
  providing, at a first time step, the first time sequence of feature vectors as input to a neural network (NN) implemented by the system;
  
  by concurrently processing the feature vectors in the first time sequence with the NN during the first time step, concurrently determining emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames, wherein concurrently determining the emission probabilities corresponding to all the temporal frames of the first sequence of temporal frames comprises;
  
  during a common time interval, determining for each feature vector in the first time sequence a respective set of emission probabilities for a first plurality of hidden Markov models (HMMs); and
  
  concurrently applying the emission probabilities determined at the first time step for the feature vectors in the first time sequence to the first plurality of HMMs to determine speech content corresponding to the first sequence of temporal frames of the audio input signal.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The article of manufacture of claim 15, wherein the operations further comprise:
    - providing, at a second time step, a second time sequence of feature vectors as input to the NN, wherein the second time sequence corresponds to a second sequence of temporal frames of the audio input signal, and wherein the second time step follows the first time step by a multiple number of temporal frame periods;
      
      processing the feature vectors in the second time sequence concurrently by the NN, wherein processing the feature vectors in the second time sequence concurrently by the NN comprises determining, at the second time step, for each feature vector in the second time sequence a respective set of emission probabilities for a second plurality of HMMs; and
      
      applying the emission probabilities determined at the second time step for the feature vectors in the second time sequence to the second plurality of HMMs to determine speech content of the second sequence of temporal frames of the audio input signal.
  - 17. The article of manufacture of claim 15, wherein providing the first time sequence of feature vectors at the input of the NN comprises:
    - providing concurrently with the first time sequence of feature vectors at least one feature vector corresponding to a temporal frame that temporally precedes the first sequence of temporal frames; and
      
      providing concurrently with the first time sequence of feature vectors at least one feature vector corresponding to a temporal frame that temporally follows the first sequence of temporal frames.
  - 18. The article of manufacture of claim 15, wherein each of the HMMs in the first plurality is associated with a respective elemental speech unit and has one or more states corresponding to one or more temporal phases of the associated, respective elemental speech unit,and wherein determining the speech content of the first sequence of temporal frames of the audio input signal comprises determining a probable sequence of elemental speech units based on a most likely sequence of states from among the one or more states of each of the HMMs in the first plurality.
  - 19. The article of manufacture of claim 18, wherein each elemental speech unit in the probable sequence of elemental speech units is a phoneme, triphone, or quinphone.
  - 20. The article of manufacture of claim 15, wherein determining speech content corresponding to the first sequence of temporal frames of the audio input signal comprises at least one of generating a text string of the speech content and identifying a computer-executable command based on the speech content.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Vanhoucke, Vincent
Primary Examiner(s)
AZAD, ABUL K

Application Number

US13/560,706
Time in Patent Office

291 Days
Field of Search

704/232, 704256-2568
US Class Current

704/232
CPC Class Codes

G10L 15/14   using statistical models, e...

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/16   using artificial neural net...

Multi-frame prediction for hybrid neural network/hidden Markov models

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Multi-frame prediction for hybrid neural network/hidden Markov models

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links