Identifying and suppressing interfering audio content

US 10,325,591 B1
Filed: 09/05/2014
Issued: 06/18/2019
Est. Priority Date: 09/05/2014
Status: Active Grant

First Claim

Patent Images

1. A speech-based system, comprising:

one or more microphones configured to produce;

a first input audio signal containing user speech and an interfering sound from a media content item played by a media player, the media player and the user in proximity to the speech-based system and the user speech including at least one spoken command for the speech-based system; and

a second input audio signal containing the user speech and the interfering sound from the media content item played by the media player;

one or more processors;

non-transitory computer-readable storage media maintaining instructions executable by the one or more processors to perform operations comprising;

selecting the first input audio signal as a first directional audio signal corresponding to a direction of a source of the user speech;

selecting the second input audio signal as a second directional audio signal corresponding to a direction other than the direction of the source of the user speech based at least in part on a directional audio signal corresponding in direction to a known position of the media player;

analyzing the second input audio signal to determine at least one characteristic of content of the second input audio signal;

requesting an identity of a player content item being currently played by the media player and a temporal point within the player content item that is currently being output by the media player;

generating an audio signature representative of the interfering sound based at least in part on the at least one characteristic of the content of the second input audio signal;

identifying a plurality of media content items that are currently accessible to the media player;

selecting a particular media content item of the plurality of media content items based at least in part on the audio signature, the identity of the player content item, the temporal point, and a reference audio signature that corresponds to the particular media content item;

receiving at least a portion of the particular media content item that corresponds to the interfering sound from a reference content source; and

processing the first input audio signal to suppress the interfering sound based at least in part on the at least the portion of the particular media content item by subtracting the portion of the particular media content item from the first input audio in order to obtain an interference-suppressed speech; and

sending the interference-suppressed speech to a remote service for performing automatic speech recognition and natural language understanding on the interference-suppressed speech in order to determine an intent to perform or initiate functions or services expressed by the spoken command.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech interface device may capture user speech for analysis by automatic speech recognition (ASR) and natural language understanding (NLU) components. However, an audio signal representing the user speech may also contain interfering sound generated by a media player that is playing audio content such as music. Before performing ASR and NLU, a system attempts to identify the content being played by the media player, such as by querying the media player or by analyzing the audio signal. The system then obtains the same content from an available source and subtracts the audio represented by the content from the audio signal.

Citations

19 Claims

1. A speech-based system, comprising:
- one or more microphones configured to produce;
  
  a first input audio signal containing user speech and an interfering sound from a media content item played by a media player, the media player and the user in proximity to the speech-based system and the user speech including at least one spoken command for the speech-based system; and
  
  a second input audio signal containing the user speech and the interfering sound from the media content item played by the media player;
  
  one or more processors;
  
  non-transitory computer-readable storage media maintaining instructions executable by the one or more processors to perform operations comprising;
  
  selecting the first input audio signal as a first directional audio signal corresponding to a direction of a source of the user speech;
  
  selecting the second input audio signal as a second directional audio signal corresponding to a direction other than the direction of the source of the user speech based at least in part on a directional audio signal corresponding in direction to a known position of the media player;
  
  analyzing the second input audio signal to determine at least one characteristic of content of the second input audio signal;
  
  requesting an identity of a player content item being currently played by the media player and a temporal point within the player content item that is currently being output by the media player;
  
  generating an audio signature representative of the interfering sound based at least in part on the at least one characteristic of the content of the second input audio signal;
  
  identifying a plurality of media content items that are currently accessible to the media player;
  
  selecting a particular media content item of the plurality of media content items based at least in part on the audio signature, the identity of the player content item, the temporal point, and a reference audio signature that corresponds to the particular media content item;
  
  receiving at least a portion of the particular media content item that corresponds to the interfering sound from a reference content source; and
  
  processing the first input audio signal to suppress the interfering sound based at least in part on the at least the portion of the particular media content item by subtracting the portion of the particular media content item from the first input audio in order to obtain an interference-suppressed speech; and
  
  sending the interference-suppressed speech to a remote service for performing automatic speech recognition and natural language understanding on the interference-suppressed speech in order to determine an intent to perform or initiate functions or services expressed by the spoken command.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The speech-based system of claim 1, the operations further comprising:
    - causing an adaptive filter to produce an interference signal that estimates the interfering sound in the first input audio signal based at least in part on the media content item; and
      
      subtracting the interference signal from the first input audio signal to produce the interference-suppressed audio signal.
  - 3. The speech-based system of claim 1, further comprising a sensor configured to detect the direction of the source of the user speech.
  - 4. The system of claim 1, wherein the audio signature is a spectrogram that represents frequency intensities of the content of the second audio signal over time.
  - 5. The system of claim 1, wherein the audio signature is a spectrogram calculated over a portion of the content of the second audio signal.
  - 6. The system of claim 5, further comprising identifying the particular media content item by determining a first portion of the spectrogram of the audio signature that corresponds to a second portion of a reference spectrogram of the particular media content item.
  - 7. The system of claim 1, wherein the audio signature is a feature vector.
  - 8. The system of claim 1, wherein the audio signature includes one or more features representing direction of energy changes within frequency bands of the content of the second audio input over time.

9. A method being performed at a speech interface device in communication with a media player and a reference content source, the method comprising:
- receiving an input audio signal from a user in proximity of the speech interface device, wherein the input audio signal comprises a first input audio signal and a second input audio signal, the first input audio signal having a higher presence of user speech than the second input audio signal and the second input audio signal having a higher presence of interfering sound produced by a media player outputting audible audio sound in proximity to the speech interface device than the first input audio signal, the user speech including spoken commands to the speech interface;
  
  selecting the first input audio signal as a first directional audio signal corresponding to a direction of a source of the user speech;
  
  selecting, based in part on a directional audio signal corresponding in direction to a known position of the media player, the second input audio signal as a second directional audio signal corresponding to a direction other than the direction of the source of the user speech;
  
  analyzing the second input audio signal to identify at least one characteristic of content of the second input audio signal;
  
  requesting, by the speech interface device form the media player, an identity of a player content item being currently played by the media player and a temporal point within the player content item that is currently being output by the media player;
  
  determining, based at least in part on the at least one characteristic of the content of the second input audio signal, the identity of the player content item, and the temporal point, an identified media content item that includes sound corresponding to the interfering sound;
  
  obtaining a matching media content item from the reference content source, wherein the matching media content item matches the identified media content item that includes the sound corresponding to the interfering sound;
  
  processing the first input audio signal to suppress the identified media content item identified as the interfering sound in the first input audio signal by subtracting the matching media content item from the input audio in order to obtain an interference-suppressed speech; and
  
  sending the interference-suppressed speech to a remote service for performing automatic speech recognition and natural language understanding on the interference-suppressed speech in order to determine an intent to perform or initiate functions or services expressed by the spoken command.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The method of claim 9, wherein processing the first input audio signal comprises removing a portion of the first input audio signal corresponding to the at least the portion of the media content item from the first input audio signal.
  - 11. The method of claim 9, further comprising:
    - identifying an audio signature of the interfering sound; and
      
      comparing the audio signature to a reference audio signature of the media content item.
  - 12. The method of claim 9, further comprising:
    - identifying a plurality of media content items that includes the media content item that are one or more of (a) currently accessible to the media player or (b) currently available to a user; and
      
      comparing the interfering sound from the second input audio signal with sound associated with the plurality of media content items.
  - 13. The method of claim 9, further comprising:
    - receiving, from the media player, an indication of a source of the interfering sound; and
      
      selecting the reference content source based at least in part on the indication of the source of the interfering sound.
  - 14. The method of claim 9, further comprising:
    - determining a starting point and an ending point of the media content item that corresponds to the interfering sound; and
      
      wherein the at least the portion of the media content item received from the reference content source corresponds to sound data associated with content between the starting point and the ending point.
  - 15. The method of claim 9, wherein the at least the portion of the media content item comprises the temporal location within the media content item.

16. A system comprising:
- one or more microphones;
  
  one or more processors of a first electronic device; and
  
  non-transitory computer-readable media maintaining instructions executable by the one or more processors of the first electronic device to perform acts comprising;
  
  receiving a first input audio signal from the one or more microphones;
  
  receiving a second input audio signal from the one or more microphones;
  
  determining a direction of a source of user speech;
  
  determining that the first input audio signal is associated with the direction of the source of the user speech, the user speech inducing a command for the system spoken by a user in proximity to the system;
  
  determining, based in part on a direction associated with a known position of a media player, that the second input audio signal is associated with a direction other than the direction of the source of the user speech, the second input audio signal having a higher presence of interfering sound produced by a media player outputting audible audio sound in proximity to the system than the first input audio signal;
  
  selecting the first input audio signal as a first directional audio signal corresponding to a direction of a source of the user speech;
  
  identifying, based at least in part on an analysis of the second input audio signal at least one characteristic of content of the second input audio signal;
  
  requesting an identity of a player content item being currently played by the media player and a temporal point within the player content item that is currently being output by the media player;
  
  identifying a media content item associated with an interfering sound in the second input audio signal that was produced by a second electronic device based at least in part on the at least one characteristic of content of the second input audio signal, the identity of the player content item, and the temporal point;
  
  determining at least a portion of the media content item that corresponds to the interfering sound;
  
  receiving, from a reference content source, the at least the portion of the media content item; and
  
  processing the first input audio signal by removing the at least the portion of the media content item from the first input audio signal by subtracting the portion of the media content item from the first input audio in order to obtain interference-suppressed speech; and
  
  sending the interference-suppressed speech to the control service for performing automatic speech recognition and natural language understanding on the interference-suppressed speech in order to determine an intent to perform or initiate functions or services expressed by the command.
- View Dependent Claims (17, 18, 19)
- - 17. The system of claim 16, the acts further comprising:
    - identifying an audio signature of the interfering sound; and
      
      wherein determining the at least the portion of the media content item that corresponds to the interfering sound comprises comparing the audio signature of the interfering sound with a reference audio signature associated with the media content item.
  - 18. The system of claim 16, wherein the first input audio signal comprises a plurality of directional audio signals and the determining the direction of the source of the user speech comprises determining a one of the plurality of directional audio signals having a highest presence of user voice.
  - 19. The system of claim 16, wherein the removing the at least the portion of the media content item comprises generating an estimated interference signal by an adaptive finite impulse response (FIR) filter and subtracting the estimated interference signal from the first input audio signal.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Pogue, Michael Alan, Piersol, Kurt Wesley
Primary Examiner(s)
Sirjani, Fariba

Application Number

US14/478,923
Time in Patent Office

1,747 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/20   Speech recognition techniqu...

G10L 2021/02082   the noise being echo, rever...

G10L 2021/02166   Microphone arrays; Beamforming

G10L 21/0208   Noise filtering

G10L 25/51   for comparison or discrimin...

Identifying and suppressing interfering audio content

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Identifying and suppressing interfering audio content

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links