Multichannel acoustic echo cancellation

US 9,659,555 B1
Filed: 02/09/2016
Issued: 05/23/2017
Est. Priority Date: 02/09/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for cancelling an echo from an audio signal to isolate received speech, the method comprising:

sending first playback audio data to a first wireless speaker;

receiving first input audio data from a first microphone of a microphone array, the first input audio data including a first representation of audible sound output by the first wireless speaker and speech input;

receiving second input audio data from a second microphone of the microphone array, the second input audio data including a second representation of the audible sound output by the first wireless speaker and the speech input;

determining a first portion of combined input audio data, the combined input audio data comprising at least the first input audio data and the second input audio data, the first portion of the combined input audio data comprising a first portion of the first input audio data corresponding to a first direction and a first portion of the second input audio data corresponding to the first direction;

determining a second portion of the combined input audio data, the second portion of the combined input audio data comprising a second portion of the first input audio data corresponding to a second direction and a second portion of the second input audio data corresponding to the second direction;

selecting at least the first portion of the combined input audio data as a first target signal on which to perform echo cancellation;

generating a first reference signal using the first playback audio data;

removing the first reference signal from the first target signal to generate a first output audio signal that includes the speech input;

selecting at least the first portion of the combined input audio data as a second target signal on which to perform echo cancellation;

generating a second reference signal using the second portion of the combined input audio data;

removing the second reference signal from the second target signal to generate a second output audio signal that includes the speech input;

performing speech recognition processing on one of the first output audio signal or the second output audio signal to determine a command; and

executing the command.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An echo cancellation system performs audio beamforming to separate audio input into multiple directions (e.g., target signals) and generates multiple audio outputs using two acoustic echo cancellation (AEC) circuits. A first AEC removes a playback reference signal (generated from a signal sent a loudspeaker) to isolate speech included in the target signals. A second AEC removes an adaptive reference signal (generated from microphone inputs corresponding to audio received from the loudspeaker) to isolate speech included in the target signals. A beam selector receives the multiple audio outputs and selects the first AEC or the second AEC based on a linearity of the system. When linear (e.g., no distortion or variable delay between microphone input and playback signal), the beam selector selects an output from the first AEC based on signal to noise (SNR) ratios. When nonlinear, the beam selector selects an output from the second AEC.

Citations

20 Claims

1. A computer-implemented method for cancelling an echo from an audio signal to isolate received speech, the method comprising:
- sending first playback audio data to a first wireless speaker;
  
  receiving first input audio data from a first microphone of a microphone array, the first input audio data including a first representation of audible sound output by the first wireless speaker and speech input;
  
  receiving second input audio data from a second microphone of the microphone array, the second input audio data including a second representation of the audible sound output by the first wireless speaker and the speech input;
  
  determining a first portion of combined input audio data, the combined input audio data comprising at least the first input audio data and the second input audio data, the first portion of the combined input audio data comprising a first portion of the first input audio data corresponding to a first direction and a first portion of the second input audio data corresponding to the first direction;
  
  determining a second portion of the combined input audio data, the second portion of the combined input audio data comprising a second portion of the first input audio data corresponding to a second direction and a second portion of the second input audio data corresponding to the second direction;
  
  selecting at least the first portion of the combined input audio data as a first target signal on which to perform echo cancellation;
  
  generating a first reference signal using the first playback audio data;
  
  removing the first reference signal from the first target signal to generate a first output audio signal that includes the speech input;
  
  selecting at least the first portion of the combined input audio data as a second target signal on which to perform echo cancellation;
  
  generating a second reference signal using the second portion of the combined input audio data;
  
  removing the second reference signal from the second target signal to generate a second output audio signal that includes the speech input;
  
  performing speech recognition processing on one of the first output audio signal or the second output audio signal to determine a command; and
  
  executing the command.
- View Dependent Claims (2, 3, 4)
- - 2. The computer-implemented method of claim 1, further comprising:
    - determining a propagation delay of the combined input audio data relative to the first playback audio data;
      
      generating second playback audio data by delaying the first playback audio data by the propagation delay;
      
      determining a frequency offset between the second playback audio data and the combined input audio data;
      
      generating the first reference signal using the second playback audio data and one of;
      
      removing at least one sample of the second playback audio data per cycle to compensate for the frequency offset, andadding a duplicate copy of at least one sample of the second playback audio data to the second playback audio data to compensate for the frequency offset.
  - 3. The computer-implemented method of claim 1, further comprising:
    - determining a first signal to noise ratio associated with the first output audio signal;
      
      determining a second signal to noise ratio associated with the second output audio signal;
      
      determining that the first signal to noise ratio is larger than the second signal to noise ratio; and
      
      performing speech recognition processing on the first output audio signal to determine the command.
  - 4. The computer-implemented method of claim 1, further comprising:
    - determining, using a fixed beamforming technique, the first and the second portions of the combined input audio data;
      
      determining that a first amplitude associated with the first portion of the combined input audio data is below a threshold;
      
      determining that a second amplitude associated with the second portion of the combined input audio data is above the threshold;
      
      determining, using an adaptive beamforming technique, a third portion of the combined input audio data, the third portion of the combined input audio data comprising a third portion of the first input audio data corresponding to the second direction and a third portion of the second input audio data corresponding to the second direction; and
      
      generating the second reference signal using the third portion of the combined input audio data.

5. A computer-implemented method, comprising:
- sending first playback audio data to a first wireless speaker;
  
  receiving first input audio data from a first microphone of a microphone array, the first input audio data including a first representation of sound output by the first wireless speaker and speech input;
  
  receiving second input audio data from a second microphone of the microphone array, the second input audio data including a second representation of the audible sound output by the first wireless speaker and the speech input;
  
  determining a first portion of combined input audio data, the combined input audio data comprising at least the first input audio data and the second input audio data, the first portion of the combined input audio data comprising a first portion of the first input audio data corresponding to a first direction and a first portion of the second input audio data corresponding to the first direction;
  
  determining a second portion of the combined input audio data, the second portion of the combined input audio data comprising a second portion of the first input audio data corresponding to a second direction and a second portion of the second input audio data corresponding to the second direction;
  
  selecting at least the first portion of the combined input audio data as a first target signal on which to perform echo cancellation;
  
  generating a first reference signal using the first playback audio data;
  
  removing the first reference signal from the first target signal to generate first output audio data that includes the speech input;
  
  selecting at least the first portion of the combined input audio data as a second target signal;
  
  generating a second reference signal using the second portion of the combined input audio data;
  
  removing the second reference signal from the second target signal to generate second output audio data that includes the speech input; and
  
  selecting one of the first output audio data or the second output audio data.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12)
- - 6. The computer-implemented method of claim 5, further comprising:
    - determining a first signal to noise ratio associated with the first output audio data;
      
      determining a second signal to noise ratio associated with the second output audio data;
      
      determining that the first signal to noise ratio is larger than the second signal to noise ratio;
      
      performing speech recognition processing on the first output audio data to determine a command; and
      
      executing the command.
  - 7. The computer-implemented method of claim 5, further comprising:
    - determining a propagation delay of the combined input audio data relative to the first playback audio data;
      
      generating second playback audio data by delaying the first playback audio data by the propagation delay;
      
      determining a frequency offset between the second playback audio data and the combined input audio data;
      
      generating the first reference signal using the second playback audio data and one of;
      
      removing at least one sample of the second playback audio data per cycle based on the frequency offset, andadding a duplicate copy of at least one sample of the second playback audio data to the second playback audio data based on the frequency offset;
      
      performing speech recognition processing on the first output audio data to determine a command; and
      
      executing the command.
  - 8. The computer-implemented method of claim 5, further comprising:
    - determining, using a fixed beamforming technique, the first and the second portions of the combined input audio data;
      
      determining that an amplitude associated with the second portion of the combined input audio data is above a threshold;
      
      determining that a highest amplitude associated with remaining portions of a plurality of portions of the combined input audio data is below the threshold;
      
      determining, using an adaptive beamforming technique, a third portion of the combined input audio data, the third portion of the combined input audio data comprising a third portion of the first input audio data corresponding to the second direction and a third portion of the second input audio data corresponding to the second direction; and
      
      generating the second reference signal using the third portion of the combined input audio data.
  - 9. The computer-implemented method of claim 5, further comprising:
    - determining that the speech input is associated with the first direction;
      
      selecting at least the first portion of the combined input audio data as the second target signal;
      
      determining that the second direction is opposite the first direction; and
      
      generating the second reference signal using the second portion of the combined input audio data.
  - 10. The computer-implemented method of claim 5, further comprising:
    - determining that the second portion of the combined input audio data corresponds to a highest amplitude of a plurality of portions of the combined input audio data;
      
      determining that an amplitude of the second portion of the combined input audio data is below a threshold;
      
      selecting the first portion of the combined input audio data as the second target signal;
      
      determining that the second direction is opposite the first direction;
      
      generating the second reference signal based on the second portion of the combined input audio data;
      
      removing the second reference signal from the second target signal to generate the second output audio data that includes the speech input;
      
      selecting the second portion of the combined input audio data as a third target signal;
      
      generating a third reference signal based on the first portion of the combined input audio data; and
      
      removing the third reference signal from the third target signal to generate third output audio data that includes the speech input.
  - 11. The computer-implemented method of claim 5, further comprising:
    - generating first reference data based on the first playback audio data, the first reference data having frequencies below a first cutoff frequency;
      
      generating second reference data based on the second portion of the combined input audio data, the second reference data having frequencies above the first cutoff frequency; and
      
      generating the second reference signal by combining the first reference data and the second reference data.
  - 12. The computer-implemented method of claim 5, further comprising:
    - generating third output audio data based on the first output audio data, the third output audio data having frequencies below a first cutoff frequency;
      
      generating fourth output audio data based on the second output audio data, the fourth output audio data having frequencies above the first cutoff frequency; and
      
      generating combined output audio data by combining the third output audio data and the fourth output audio data.

13. A device, comprising:
- at least one processor;
  
  a memory device including instructions operable to be executed by the at least one processor to configure the device to;
  
  send first playback audio data to a first wireless speaker;
  
  receive first input audio data from a first microphone of a microphone array, the first input audio data including a first representation of sound output by the first wireless speaker and speech input;
  
  receive second input audio data from a second microphone of the microphone array, the second input audio data including a second representation of the audible sound output by the first wireless speaker and the speech input;
  
  determine a first portion of combined input audio data, the combined input audio data comprising at least the first input audio data and the second input audio data, the first portion of the combined input audio data comprising a first portion of the first input audio data corresponding to a first direction and a first portion of the second input audio data corresponding to the first direction;
  
  determine a second portion of the combined input audio data, the second portion of the combined input audio data comprising a second portion of the first input audio data corresponding to a second direction and a second portion of the second input audio data corresponding to the second direction;
  
  select at least the first portion of the combined input audio data as a first target signal on which to perform echo cancellation;
  
  generate a first reference signal using the first playback audio data;
  
  remove the first reference signal from the first target signal to generate first output audio data that includes the speech input;
  
  select at least the first portion of the combined input audio data as a second target signal;
  
  generate a second reference signal using the second portion of the combined input audio data;
  
  remove the second reference signal from the second target signal to generate second output audio data that includes the speech input; and
  
  select one of the first output audio data or the second output audio data.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The device of claim 13, wherein the instructions further configure the device to:
    - determine a first signal to noise ratio associated with the first output audio data;
      
      determine a second signal to noise ratio associated with the second output audio data;
      
      determine that the first signal to noise ratio is larger than the second signal to noise ratio;
      
      perform speech recognition processing on the first output audio data to determine a command; and
      
      execute the command.
  - 15. The device of claim 13, wherein the instructions further configure the device to:
    - determine a propagation delay of the combined input audio data relative to the first playback audio data;
      
      generate second playback audio data by delaying the first playback audio data by the propagation delay;
      
      determine a frequency offset between the second playback audio data and the combined input audio data;
      
      generate the first reference signal using the second playback audio data and one of;
      
      removing at least one sample of the second playback audio data per cycle based on the frequency offset, andadding a duplicate copy of at least one sample of the second playback audio data to the second playback audio data based on the frequency offset;
      
      perform speech recognition processing on the first output audio data to determine a command; and
      
      execute the command.
  - 16. The device of claim 13, wherein the instructions further configure the device to:
    - determine, using a fixed beamforming technique, the first and the second portions of the combined input audio data;
      
      determine that an amplitude associated with the second portion of the combined input audio data is above a threshold;
      
      determine that a highest amplitude associated with remaining portions of a plurality of portions of the combined input audio data is below the threshold;
      
      determine, using an adaptive beamforming technique, a third portion of the combined input audio data, the third portion of the combined input audio data comprising a third portion of the first input audio data corresponding to the second direction and a third portion of the second input audio data corresponding to the second direction; and
      
      generate the second reference signal using the third portion of the combined input audio data.
  - 17. The device of claim 13, wherein the instructions further configure the device to:
    - determine that the speech input is associated with the first direction;
      
      select at least the first portion of the combined input audio data as the second target signal;
      
      determine that the second direction is opposite the first direction; and
      
      generate the second reference signal using the second portion of the combined input audio data.
  - 18. The device of claim 13, wherein the instructions further configure the device to:
    - determine that the second portion of the combined input audio data corresponds to a highest amplitude of a plurality of portions of the combined input audio data;
      
      determine that an amplitude of the second portion of the combined input audio data is below a threshold;
      
      select the first portion of the combined input audio data as the second target signal;
      
      determine that the second direction is opposite the first direction;
      
      generate the second reference signal based on the second portion of the combined input audio data;
      
      remove the second reference signal from the second target signal to generate the second output audio data that includes the speech input;
      
      select the second portion of the combined input audio data as a third target signal;
      
      generate a third reference signal based on the first portion of the combined input audio data; and
      
      remove the third reference signal from the third target signal to generate third output audio data that includes the speech input.
  - 19. The device of claim 13, wherein the instructions further configure the device to:
    - generate first reference data based on the first playback audio data, the first reference data having frequencies below a first cutoff frequency;
      
      generate second reference data based on the second portion of the combined input audio data, the second reference data having frequencies above the first cutoff frequency; and
      
      generate the second reference signal by combining the first reference data and the second reference data.
  - 20. The device of claim 13, wherein the instructions further configure the device to:
    - generate third output audio data based on the first output audio data, the third output audio data having frequencies below a first cutoff frequency;
      
      generate fourth output audio data based on the second output audio data, the fourth output audio data having frequencies above the first cutoff frequency; and
      
      generate combined output audio data by combining the third output audio data and the fourth output audio data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Hilmes, Philip Ryan, Ayrapetian, Robert
Primary Examiner(s)
Bernardi, Brenda

Application Number

US15/019,129
Time in Patent Office

469 Days
Field of Search
US Class Current
CPC Class Codes

G10K 11/002   Devices for damping, suppre...

G10K 2210/3028   Filtering, e.g. Kalman filt...

G10K 2210/3044   Phase shift, e.g. complex e...

G10K 2210/3046   Multiple acoustic inputs, m...

G10K 2210/505   Echo cancellation, e.g. mul...

G10K 2210/509   Hybrid, i.e. combining diff...

G10L 2015/223   Execution procedure of a sp...

G10L 2021/02082   the noise being echo, rever...

G10L 2021/02166   Microphone arrays; Beamforming

G10L 21/0272   Voice signal separating

H04M 9/082   using echo cancellers echo ...

H04R 2203/12   Beamforming aspects for ste...

H04R 2430/25   Array processing for suppre...

H04R 3/005   for combining the signals o...

H04R 3/02   for preventing acoustic rea...

Multichannel acoustic echo cancellation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Multichannel acoustic echo cancellation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links