Identifying and suppressing interfering audio content
First Claim
Patent Images
1. A speech-based system, comprising:
- one or more microphones configured to produce;
a first input audio signal containing user speech and an interfering sound from a media content item played by a media player, the media player and the user in proximity to the speech-based system and the user speech including at least one spoken command for the speech-based system; and
a second input audio signal containing the user speech and the interfering sound from the media content item played by the media player;
one or more processors;
non-transitory computer-readable storage media maintaining instructions executable by the one or more processors to perform operations comprising;
selecting the first input audio signal as a first directional audio signal corresponding to a direction of a source of the user speech;
selecting the second input audio signal as a second directional audio signal corresponding to a direction other than the direction of the source of the user speech based at least in part on a directional audio signal corresponding in direction to a known position of the media player;
analyzing the second input audio signal to determine at least one characteristic of content of the second input audio signal;
requesting an identity of a player content item being currently played by the media player and a temporal point within the player content item that is currently being output by the media player;
generating an audio signature representative of the interfering sound based at least in part on the at least one characteristic of the content of the second input audio signal;
identifying a plurality of media content items that are currently accessible to the media player;
selecting a particular media content item of the plurality of media content items based at least in part on the audio signature, the identity of the player content item, the temporal point, and a reference audio signature that corresponds to the particular media content item;
receiving at least a portion of the particular media content item that corresponds to the interfering sound from a reference content source; and
processing the first input audio signal to suppress the interfering sound based at least in part on the at least the portion of the particular media content item by subtracting the portion of the particular media content item from the first input audio in order to obtain an interference-suppressed speech; and
sending the interference-suppressed speech to a remote service for performing automatic speech recognition and natural language understanding on the interference-suppressed speech in order to determine an intent to perform or initiate functions or services expressed by the spoken command.
2 Assignments
0 Petitions
Accused Products
Abstract
A speech interface device may capture user speech for analysis by automatic speech recognition (ASR) and natural language understanding (NLU) components. However, an audio signal representing the user speech may also contain interfering sound generated by a media player that is playing audio content such as music. Before performing ASR and NLU, a system attempts to identify the content being played by the media player, such as by querying the media player or by analyzing the audio signal. The system then obtains the same content from an available source and subtracts the audio represented by the content from the audio signal.
-
Citations
19 Claims
-
1. A speech-based system, comprising:
-
one or more microphones configured to produce; a first input audio signal containing user speech and an interfering sound from a media content item played by a media player, the media player and the user in proximity to the speech-based system and the user speech including at least one spoken command for the speech-based system; and a second input audio signal containing the user speech and the interfering sound from the media content item played by the media player; one or more processors; non-transitory computer-readable storage media maintaining instructions executable by the one or more processors to perform operations comprising; selecting the first input audio signal as a first directional audio signal corresponding to a direction of a source of the user speech; selecting the second input audio signal as a second directional audio signal corresponding to a direction other than the direction of the source of the user speech based at least in part on a directional audio signal corresponding in direction to a known position of the media player; analyzing the second input audio signal to determine at least one characteristic of content of the second input audio signal; requesting an identity of a player content item being currently played by the media player and a temporal point within the player content item that is currently being output by the media player; generating an audio signature representative of the interfering sound based at least in part on the at least one characteristic of the content of the second input audio signal; identifying a plurality of media content items that are currently accessible to the media player; selecting a particular media content item of the plurality of media content items based at least in part on the audio signature, the identity of the player content item, the temporal point, and a reference audio signature that corresponds to the particular media content item; receiving at least a portion of the particular media content item that corresponds to the interfering sound from a reference content source; and processing the first input audio signal to suppress the interfering sound based at least in part on the at least the portion of the particular media content item by subtracting the portion of the particular media content item from the first input audio in order to obtain an interference-suppressed speech; and sending the interference-suppressed speech to a remote service for performing automatic speech recognition and natural language understanding on the interference-suppressed speech in order to determine an intent to perform or initiate functions or services expressed by the spoken command. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method being performed at a speech interface device in communication with a media player and a reference content source, the method comprising:
-
receiving an input audio signal from a user in proximity of the speech interface device, wherein the input audio signal comprises a first input audio signal and a second input audio signal, the first input audio signal having a higher presence of user speech than the second input audio signal and the second input audio signal having a higher presence of interfering sound produced by a media player outputting audible audio sound in proximity to the speech interface device than the first input audio signal, the user speech including spoken commands to the speech interface; selecting the first input audio signal as a first directional audio signal corresponding to a direction of a source of the user speech; selecting, based in part on a directional audio signal corresponding in direction to a known position of the media player, the second input audio signal as a second directional audio signal corresponding to a direction other than the direction of the source of the user speech; analyzing the second input audio signal to identify at least one characteristic of content of the second input audio signal; requesting, by the speech interface device form the media player, an identity of a player content item being currently played by the media player and a temporal point within the player content item that is currently being output by the media player; determining, based at least in part on the at least one characteristic of the content of the second input audio signal, the identity of the player content item, and the temporal point, an identified media content item that includes sound corresponding to the interfering sound; obtaining a matching media content item from the reference content source, wherein the matching media content item matches the identified media content item that includes the sound corresponding to the interfering sound; processing the first input audio signal to suppress the identified media content item identified as the interfering sound in the first input audio signal by subtracting the matching media content item from the input audio in order to obtain an interference-suppressed speech; and sending the interference-suppressed speech to a remote service for performing automatic speech recognition and natural language understanding on the interference-suppressed speech in order to determine an intent to perform or initiate functions or services expressed by the spoken command. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A system comprising:
-
one or more microphones; one or more processors of a first electronic device; and non-transitory computer-readable media maintaining instructions executable by the one or more processors of the first electronic device to perform acts comprising; receiving a first input audio signal from the one or more microphones; receiving a second input audio signal from the one or more microphones; determining a direction of a source of user speech; determining that the first input audio signal is associated with the direction of the source of the user speech, the user speech inducing a command for the system spoken by a user in proximity to the system; determining, based in part on a direction associated with a known position of a media player, that the second input audio signal is associated with a direction other than the direction of the source of the user speech, the second input audio signal having a higher presence of interfering sound produced by a media player outputting audible audio sound in proximity to the system than the first input audio signal; selecting the first input audio signal as a first directional audio signal corresponding to a direction of a source of the user speech; identifying, based at least in part on an analysis of the second input audio signal at least one characteristic of content of the second input audio signal; requesting an identity of a player content item being currently played by the media player and a temporal point within the player content item that is currently being output by the media player; identifying a media content item associated with an interfering sound in the second input audio signal that was produced by a second electronic device based at least in part on the at least one characteristic of content of the second input audio signal, the identity of the player content item, and the temporal point; determining at least a portion of the media content item that corresponds to the interfering sound; receiving, from a reference content source, the at least the portion of the media content item; and processing the first input audio signal by removing the at least the portion of the media content item from the first input audio signal by subtracting the portion of the media content item from the first input audio in order to obtain interference-suppressed speech; and sending the interference-suppressed speech to the control service for performing automatic speech recognition and natural language understanding on the interference-suppressed speech in order to determine an intent to perform or initiate functions or services expressed by the command. - View Dependent Claims (17, 18, 19)
-
Specification