Audio playback settings for voice interaction

US 9,942,678 B1
Filed: 09/27/2016
Issued: 04/10/2018
Est. Priority Date: 09/27/2016
Status: Active Grant

First Claim

Patent Images

1. A playback device comprising:

a network interface;

one or more microphones;

an audio stage comprising an amplifier;

one or more speakers;

one or more processors;

a housing, the housing carrying at least the network interface, the one or more microphones, the audio stage, the one or more speakers, the one or more processors, and a computer-readable media having stored therein instructions executable by the one or more processors to cause the playback device to perform operations comprising;

while playing back first audio in a given environment at a given loudness via the audio stage and the one or more speakers;

(a) capturing, via the one or more microphones, a voice input;

(b) determining that the captured voice input includes audio data representing a wake word to invoke a voice assistant service;

(c) in response to determining that the captured voice input includes audio data representing the wake word to invoke the voice assistant service;

(i) sending, via the network interface to one or more servers of the voice assistant service, the voice input and (ii) determining a loudness of background noise in the given environment, wherein the background noise comprises ambient noise in the given environment;

(d) after determining the loudness of background noise, receiving, via the network interface from the one or more servers of the voice assistant service in response to the voice input, second audio data representing a spoken response to the voice input;

in response to receiving the second audio data representing the spoken response to the voice input, ducking the first audio in proportion to a difference between the given loudness of the first audio and the determined loudness of the background noise; and

playing back the ducked first audio concurrently with the second audio representing the spoken response to the voice input via the audio stage and the one or more speakers.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Example techniques relate to voice interaction in an environment with a media playback system that is playing back audio content. In an example implementation, while playing back first audio in a given environment at a given loudness: a playback device (a) detects that an event is anticipated in the given environment, the event involving playback of second audio and (b) determines a loudness of background noise in the given environment, the background noise comprising ambient noise in the given environment. The playback device ducks the first audio in proportion to a difference between the given loudness of the first audio and the determined loudness of the background noise and plays back the ducked first audio concurrently with the second audio.

205 Citations

20 Claims

1. A playback device comprising:
- a network interface;
  
  one or more microphones;
  
  an audio stage comprising an amplifier;
  
  one or more speakers;
  
  one or more processors;
  
  a housing, the housing carrying at least the network interface, the one or more microphones, the audio stage, the one or more speakers, the one or more processors, and a computer-readable media having stored therein instructions executable by the one or more processors to cause the playback device to perform operations comprising;
  
  while playing back first audio in a given environment at a given loudness via the audio stage and the one or more speakers;
  
  (a) capturing, via the one or more microphones, a voice input;
  
  (b) determining that the captured voice input includes audio data representing a wake word to invoke a voice assistant service;
  
  (c) in response to determining that the captured voice input includes audio data representing the wake word to invoke the voice assistant service;
  
  (i) sending, via the network interface to one or more servers of the voice assistant service, the voice input and (ii) determining a loudness of background noise in the given environment, wherein the background noise comprises ambient noise in the given environment;
  
  (d) after determining the loudness of background noise, receiving, via the network interface from the one or more servers of the voice assistant service in response to the voice input, second audio data representing a spoken response to the voice input;
  
  in response to receiving the second audio data representing the spoken response to the voice input, ducking the first audio in proportion to a difference between the given loudness of the first audio and the determined loudness of the background noise; and
  
  playing back the ducked first audio concurrently with the second audio representing the spoken response to the voice input via the audio stage and the one or more speakers.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The playback device of claim 1, wherein playing back the ducked first audio concurrently with the second audio comprises:
    - determining that a difference between a loudness of the ducked first audio and a given dynamic range is (a) less than the determined loudness of the background noise or (b) greater than the determined loudness of the background noise;
      
      when the determined difference between the loudness of the ducked first audio and the given dynamic range is less than the determined loudness of the background noise, compressing the first audio to a dynamic range that is louder than the determined loudness of the background noise and playing back the compressed first audio; and
      
      when the determined difference between the loudness of the ducked first audio and the given dynamic range is greater than the determined loudness of the background noise, playing back the first audio without compression.
  - 3. The playback device of claim 1, wherein playing back the ducked first audio concurrently with the second audio comprises:
    - detecting that a signal-to-noise ratio within the given environment is below a voice input threshold; and
      
      responsively, filtering the first audio, wherein filtering the first audio comprises cutting the first audio in a frequency range corresponding to human speech.
  - 4. The playback device of claim 3, wherein filtering the first audio further comprises boosting the first audio outside the frequency range corresponding to human speech.
  - 5. The playback device of claim 1, wherein the playback device is a first playback device of a group of playback device that includes one or more second playback devices, and wherein playing back the ducked first audio concurrently with the second audio comprises playing back the ducked first audio concurrently with the second audio in synchrony with the one or more second playback devices.
  - 6. The playback device of claim 1, wherein determining the loudness of background noise in the given environment comprises measuring the loudness of the background noise in the given environment via one or more microphones.
  - 7. The playback device of claim 6, wherein at least one of the one or more microphones is housed in a networked microphone device that is distinct from the playback device, and wherein measuring the loudness of the background noise in the given environment via one or more microphones comprises causing the networked microphone device to measure the loudness of the background noise in the given environment.
  - 8. The playback device of claim 6, wherein measuring the loudness of the background noise in the given environment comprises offsetting the first audio being played back by the playback device from the measurement of the background noise in the given environment.
  - 9. The playback device of claim 1, wherein the operations further comprise detecting that the spoken response to the voice input has been played back, and wherein ducking the first audio comprises ducking the first audio until the spoken response to the voice input has been played back.
  - 10. The tangible, non-transitory computer-readable medium of claim 1, wherein the operations further comprise detecting that the spoken response to the voice input has been played back, and wherein ducking the first audio comprises ducking the first audio for a pre-determined period of time after the spoken response to the voice input has been played back.

11. A tangible, non-transitory computer-readable medium having stored therein instructions executable by one or more processors to cause a playback device to perform a method, the playback device comprising a housing carrying at least a network interface, one or more microphones, an audio stage, one or more speakers, and the one or more processors, and the method comprising:
- while playing back first audio in a given environment at a given loudness via the audio stage and the one or more speakers;
  
  (a) capturing, via the one or more microphones, a voice input;
  
  (b) determining that the captured voice input includes audio data representing a wake word to invoke a voice assistant service;
  
  (c) in response to determining that the captured voice input includes audio data representing the wake word to invoke the voice assistant service;
  
  (i) sending, via the network interface to one or more servers of the voice assistant service, the voice input and (ii) determining a loudness of background noise in the given environment, wherein the background noise comprises ambient noise in the given environment;
  
  (d) after determining the loudness of background noise, receiving, via the network interface from the one or more servers of the voice assistant service in response to the voice input, second audio data representing a spoken response to the voice input;
  
  in response to receiving the second audio data representing the spoken response to the voice input, ducking the first audio in proportion to a difference between the given loudness of the first audio and the determined loudness of the background noise; and
  
  playing back the ducked first audio concurrently with the second audio representing the spoken response to the voice input via the audio stage and the one or more speakers.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The tangible, non-transitory computer-readable medium of claim 11, wherein playing back the ducked first audio concurrently with the second audio comprises:
    - determining that a difference between a loudness of the ducked first audio and a given dynamic range is (a) less than the determined loudness of the background noise or (b) greater than the determined loudness of the background noise;
      
      when the determined difference between the loudness of the ducked first audio and the given dynamic range is less than the determined loudness of the background noise, compressing the first audio to a dynamic range that is louder than the determined loudness of the background noise and playing back the compressed first audio; and
      
      when the determined difference between the loudness of the ducked first audio and the given dynamic range is greater than the determined loudness of the background noise, playing back the first audio without compression.
  - 13. The tangible, non-transitory computer-readable medium of claim 11, wherein playing back the ducked first audio concurrently with the second audio comprises:
    - detecting that a signal-to-noise ratio within the given environment is below a voice input threshold; and
      
      in response to detecting that the signal-to-noise ratio within the given environment is below the voice input threshold, filtering the first audio, wherein filtering the first audio comprises cutting the first audio in a frequency range corresponding to human speech.
  - 14. The tangible, non-transitory computer-readable medium of claim 11, wherein playing back the ducked first audio concurrently with the second audio comprises:
    - detecting that a signal-to-noise ratio within the given environment is below a voice input threshold; and
      
      responsively, filtering the first audio, wherein filtering the first audio comprises cutting the first audio in a frequency range corresponding to human speech.
  - 15. The tangible, non-transitory computer-readable medium of claim 11, wherein filtering the first audio further comprises boosting the first audio outside the frequency range corresponding to human speech.
  - 16. The tangible, non-transitory computer-readable medium of claim 11, wherein the operations further comprise detecting that the spoken response to the voice input has been played back, and wherein ducking the first audio comprises ducking the first audio until the spoken response to the voice input has been played back.

17. A method to be performed by a playback device comprising a housing carrying at least a network interface, one or more microphones, an audio stage, one or more speakers, the method comprising:
- while playing back first audio in a given environment at a given loudness via the audio stage and the one or more speakers, the playback device;
  
  (a) capturing, via the one or more microphones, a voice input;
  
  (b) determining that the captured voice input includes audio data representing a wake word to invoke a voice assistant service;
  
  (c) in response to determining that the captured voice input includes audio data representing the wake word to invoke the voice assistant service;
  
  (i) sending, via the network interface to one or more servers of the voice assistant service, the voice input and (ii) determining a loudness of background noise in the given environment, wherein the background noise comprises ambient noise in the given environment;
  
  (d) after determining the loudness of background noise, receiving, via the network interface from the one or more servers of the voice assistant service in response to the voice input, second audio data representing a spoken response to the voice input;
  
  in response to receiving the second audio data representing the spoken response to the voice input, the playback device ducking the first audio in proportion to a difference between the given loudness of the first audio and the determined loudness of the background noise; and
  
  the playback device playing back the ducked first audio concurrently with the second audio representing the spoken response to the voice input via the audio stage and the one or more speakers.
- View Dependent Claims (18, 19, 20)
- - 18. The method of claim 17, wherein playing back the ducked first audio concurrently with the second audio comprises:
    - determining that a difference between a loudness of the ducked first audio and a given dynamic range is (a) less than the determined loudness of the background noise or (b) greater than the determined loudness of the background noise;
      
      when the determined difference between the loudness of the ducked first audio and the given dynamic range is less than the determined loudness of the background noise, compressing the first audio to a dynamic range that is louder than the determined loudness of the background noise and playing back the compressed first audio; and
      
      when the determined difference between the loudness of the ducked first audio and the given dynamic range is greater than the determined loudness of the background noise, playing back the first audio without compression.
  - 19. The method of claim 17, wherein playing back the ducked first audio concurrently with the second audio comprises:
    - detecting that a signal-to-noise ratio within the given environment is below a voice input threshold; and
      
      responsively, filtering the first audio, wherein filtering the first audio comprises cutting the first audio in a frequency range corresponding to human speech.
  - 20. The method of claim 17, wherein filtering the first audio further comprises boosting the first audio outside the frequency range corresponding to human speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sonos, Inc.
Original Assignee
Sonos, Inc.
Inventors
Hartung, Klaus, Kadri, Romi
Primary Examiner(s)
NGUYEN, QUYNH H

Application Number

US15/277,810
Publication Number

US 20180091913A1
Time in Patent Office

560 Days
Field of Search

381110
US Class Current
CPC Class Codes

G06F 3/165   Management of the audio str...

G10L 15/22   Procedures used during a sp...

G10L 2015/088   Word spotting

G10L 25/84   for discriminating voice fr...

H03G 3/32   the control being dependent...

H03G 3/342   Muting when some special ch...

H04R 2227/003   Digital PA systems using, e...

H04R 2227/005   Audio distribution systems ...

H04R 2420/07   Applications of wireless lo...

H04R 2430/01   Aspects of volume control, ...

H04R 27/00   Public address systems circ...

H04R 29/007   for public address systems ...

Audio playback settings for voice interaction

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

205 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Audio playback settings for voice interaction

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

205 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others