Acoustic echo cancellation based sub band domain active speaker detection for audio and video conferencing applications

US 10,771,621 B2
Filed: 04/02/2018
Issued: 09/08/2020
Est. Priority Date: 10/31/2017
Status: Active Grant

First Claim

Patent Images

1. A method for detecting an active speaker in at least a two-way conference comprising:

analyzing real time audio in one or more sub band domains according to an echo canceller model, wherein the echo canceller model includes at least in part processing the real time audio using an acoustic echo cancellation linear filter;

determining, based on the analyzed real time audio, one or more audio metrics;

weighting, via a trained machine learning model, the one or more audio metrics based on importance of the one or more audio metrics for active speaker determination in the one or more sub band domains;

summing the one or more weighted audio metrics;

comparing the one or more summed weighted audio metrics and a hysteresis threshold;

in response to the one or more summed weighted audio metrics being greater than the hysteresis threshold, determining a speaker status as active; and

in response to the speaker status being active, removing one or more of residual echo or noise from the real time audio based on the weighted one or more audio metrics.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods, and devices are disclosed for detecting an active speaker in a two-way conference. Real time audio in one or more sub band domains are analyzed according to an echo cancellor model. Based on the analyzed real time audio, one or more audio metrics are determined from output from an acoustic echo cancellation linear filter. The one or more audio metrics are weighted based on a priority, and a speaker status is determined based on the weighted one or more audio metrics being analyzed according to an active speaker detection model. For an active speaker status, one or more residual echo or noise is removed from the real time audio based on the one or more audio metrics.

245 Citations

19 Claims

1. A method for detecting an active speaker in at least a two-way conference comprising:
- analyzing real time audio in one or more sub band domains according to an echo canceller model, wherein the echo canceller model includes at least in part processing the real time audio using an acoustic echo cancellation linear filter;
  
  determining, based on the analyzed real time audio, one or more audio metrics;
  
  weighting, via a trained machine learning model, the one or more audio metrics based on importance of the one or more audio metrics for active speaker determination in the one or more sub band domains;
  
  summing the one or more weighted audio metrics;
  
  comparing the one or more summed weighted audio metrics and a hysteresis threshold;
  
  in response to the one or more summed weighted audio metrics being greater than the hysteresis threshold, determining a speaker status as active; and
  
  in response to the speaker status being active, removing one or more of residual echo or noise from the real time audio based on the weighted one or more audio metrics.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the one or more of residual echo or noise from the real time audio is removed nonlinearly based on the one or more audio metrics.
  - 3. The method of claim 1, wherein the one or more audio metrics determines the speed of the hysteresis model.
  - 4. The method of claim 1, wherein one or more weights used in weighting the one or more audio metrics is generated dynamically from a machine learned model.
  - 5. The method of claim 1, wherein the one or more audio metrics include cross correlations among two or more of echo, microphone data, linear acoustic echo canceller output per sub band, background noise state, echo state, voice activity, and pitch state.
  - 6. The method of claim 1, wherein the echo canceller model cancels echo linearly without distorting speech.
  - 7. The method of claim 1, wherein the real time audio is captured by a microphone proximate to a participant in the two-way conference.

8. A system for detecting an active speaker in at least a two-way conference comprising:
- at least one receiver for receiving real time audio;
  
  a conference server communicatively coupled to the at least one receiver, wherein the conference server is configured to;
  
  analyze the real time audio in one or more sub band domains according to an echo canceller model, wherein the echo canceller model includes at least in part processing the real time audio using an acoustic echo cancellation linear filter;
  
  determine, based on the analyzed real time audio, one or more audio metrics;
  
  weight, via a trained machine learning model, the one or more audio metrics based on importance of the one or more audio metrics for active speaker determination in the one or more sub band domains;
  
  sum the one or more weighted audio metrics;
  
  compare the one or more summed weighted audio metrics and a hysteresis threshold;
  
  in response to the one or more summed weighted audio metrics being greater than the hysteresis threshold, determine a speaker status as active; and
  
  in response to the speaker status being active, remove one or more of residual echo or noise from the real time audio based on the weighted one or more audio metrics; and
  
  a loudspeaker in communication with the conference server configured to output the real time audio.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The system of claim 8, wherein the one or more of residual echo or noise from the real time audio is removed nonlinearly based on the one or more audio metrics.
  - 10. The system of claim 8, wherein the one or more audio metrics determines the speed of the hysteresis model.
  - 11. The system of claim 8, wherein one or more weights used in weighting the one or more audio metrics is generated dynamically from a machine learned model.
  - 12. The system of claim 8, wherein the one or more audio metrics include cross correlations among two or more of echo, microphone data, linear acoustic echo canceller output per sub band, background noise state, echo state, voice activity, and pitch state.
  - 13. The system of claim 8, wherein the echo canceller model cancels echo linearly without distorting speech.

14. A non-transitory computer-readable medium containing instructions that, when executed by a computing system, cause the computing system to:
- analyze real time audio in one or more sub band domains according to an echo canceller model, wherein the echo canceller model includes at least in part processing the real time audio using an acoustic echo cancellation linear filter;
  
  determine, based on the analyzed real time audio, one or more audio metrics;
  
  weight, via a trained machine learning model, the one or more audio metrics based on importance of the one or more audio metrics for active speaker determination in the one or more sub band domains;
  
  sum the one or more weighted audio metrics;
  
  compare the one or more summed weighted audio metrics and a hysteresis threshold;
  
  in response to the one or more summed weighted audio metrics being greater than the hysteresis threshold, determine a speaker status is active; and
  
  in response to the speaker status being active, remove one or more of residual echo or noise from the real time audio based on the weighted one or more audio metrics.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The non-transitory computer-readable medium of claim 14, wherein the one or more of residual echo or noise from the real time audio is removed nonlinearly based on the one or more audio metrics.
  - 16. The non-transitory computer-readable medium of claim 14, wherein the one or more audio metrics determines the speed of the hysteresis model.
  - 17. The non-transitory computer-readable medium of claim 14, wherein one or more weights used in weighting the one or more audio metrics is generated dynamically from a machine learned model.
  - 18. The non-transitory computer-readable medium of claim 14, wherein the one or more audio metrics include cross correlations among two or more of echo, microphone data, linear acoustic echo canceller output per sub band, background noise state, echo state, voice activity, and pitch state.
  - 19. The non-transitory computer-readable medium of claim 14, wherein the echo canceller model cancels echo linearly without distorting speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Original Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Inventors
Liu, Fuling, Chen, Eric, Li, Wei, Hsu, Wei-Lien
Primary Examiner(s)
Shah, Antim G

Application Number

US15/943,336
Publication Number

US 20190132452A1
Time in Patent Office

890 Days
Field of Search

None
US Class Current
CPC Class Codes

G06N 3/048   Activation functions

G06N 3/084   Backpropagation, e.g. using...

G10L 2021/02082   the noise being echo, rever...

G10L 21/0208   Noise filtering

G10L 25/48   specially adapted for parti...

H04M 3/002   Applications of echo suppre...

H04M 3/568   audio processing specific t...

H04M 9/082   using echo cancellers echo ...

Acoustic echo cancellation based sub band domain active speaker detection for audio and video conferencing applications

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

245 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Acoustic echo cancellation based sub band domain active speaker detection for audio and video conferencing applications

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

245 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links