Processing multi-channel audio waveforms

US 9,697,826 B2
Filed: 07/08/2016
Issued: 07/04/2017
Est. Priority Date: 03/27/2015
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more computers and one or more data storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;

receiving multiple channels of audio data corresponding to an utterance;

convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model;

combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data;

inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and

providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, including computer programs encoded on a computer storage medium, for enhancing the processing of audio waveforms for speech recognition using various neural network processing techniques. In one aspect, a method includes: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined.

186 Citations

20 Claims

1. A system comprising:
- one or more computers and one or more data storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving multiple channels of audio data corresponding to an utterance;
  
  convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model;
  
  combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data;
  
  inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and
  
  providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The system of claim 1, wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other.
  - 3. The system of claim 1, wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers.
  - 4. The system of claim 1, wherein the convolutional layer of the deep neural network is configured to perform a frequency domain convolution.
  - 5. The system of claim 3, wherein the deep neural network is configured such that output of convolutional layer is input to at least one of the one or more LSTM layers, and output of the one or more LSTM layers is input to at least one of the multiple hidden layers.
  - 6. The system of claim 1, wherein combining the convolution outputs comprises:
    - summing, for each of the multiple filters, the convolution outputs obtained for different channels using the filter to generate summed outputs corresponding to different time periods; and
      
      pooling, for each of the multiple filters, the summed outputs across the different time periods to generated a set of pooled values for the filter.
  - 7. The system of claim 6, wherein pooling the summed outputs across the different time periods comprises max pooling the summed outputs across the different time periods to identify maximum values among the summed outputs for the different time periods.
  - 8. The system of claim 6, wherein combining the convolution outputs comprises applying a rectified non-linearity to the sets of pooled values for each of the multiple filters to obtain rectified values;
    - wherein inputting the combined convolution outputs to the deep neural network comprises inputting the rectified values to the deep neural network.
  - 9. The system of claim 8, wherein the rectified non-linearity comprises a logarithm compression.
  - 10. The system of claim 1, wherein the filters are configured to perform both spatial and spectral filtering.
  - 11. The system of claim 1, wherein the training process that jointly trains the multiple filters and trains the deep neural network as an acoustic model comprises training the multiple filters and the deep neural network using a single module of an automated speech recognizer.
  - 12. The system of claim 1, wherein the training process that jointly trains the multiple filters and trains the deep neural network as an acoustic model is performed using training data that includes audio data from a plurality of different microphone spacing configurations.

13. A computer-implemented method comprising:
- receiving multiple channels of audio data corresponding to an utterance;
  
  convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model;
  
  combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data;
  
  inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and
  
  providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The method of claim 13, wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other.
  - 15. The method of claim 13, wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers.
  - 16. The method of claim 13, wherein the convolutional layer of the deep neural network is configured to perform a frequency domain convolution.
  - 17. The method of claim 15, wherein the deep neural network is configured such that output of convolutional layer is input to at least one of the one or more LSTM layers, and output of the one or more LSTM layers is input to at least one of the multiple hidden layers.

18. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
- receiving multiple channels of audio data corresponding to an utterance;
  
  convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model;
  
  combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data;
  
  inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and
  
  providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs.
- View Dependent Claims (19, 20)
- - 19. The non-transitory computer-readable medium of claim 18, wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other.
  - 20. The non-transitory computer-readable medium of claim 18, wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Weiss, Ron J., Senior, Andrew W., Sainath, Tara N., Wilson, Kevin William, Narayanan, Arun, Hoshen, Yedid, Bacchiani, Michiel A. U.
Primary Examiner(s)
ALBERTALLI, BRIAN LOUIS

Application Number

US15/205,321
Publication Number

US 20160322055A1
Time in Patent Office

361 Days
Field of Search
US Class Current
CPC Class Codes

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G10L 15/02   Feature extraction for spee...

G10L 15/063   Training

G10L 15/16   using artificial neural net...

G10L 2021/02166   Microphone arrays; Beamforming

H04R 3/005   for combining the signals o...

Processing multi-channel audio waveforms

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

186 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Processing multi-channel audio waveforms

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

186 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links