Keyword detection modeling using contextual and environmental information
First Claim
Patent Images
1. A system comprising:
- a computer-readable memory storing executable instructions; and
one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least;
obtain from a client device;
an audio signal, wherein a first portion of the audio signal comprises audio data likely corresponding to a wake word, and wherein a second portion of the audio signal does not comprise audio data likely corresponding to the wake word;
contextual information associated with the audio signal; and
information indicating the first portion of the audio signal comprises audio data likely corresponding to the wake word;
obtain acoustic information and environmental information from the first portion of the audio signal, wherein the acoustic information reflects one or more characteristics of a voice in the audio signal, and wherein the environmental information reflects one or more characteristics of an environment in which sound in the audio signal was recorded;
determine whether audio data corresponding to the wake word is present in the audio signal using a server-side detection model configured to generate a detection score using the contextual information, the environmental information, the acoustic information, and natural language understanding results generated based at least partly on at least one of the audio signal or a subsequent audio signal, wherein a detection score greater than a detection threshold indicates that audio data corresponding to the wake word is present in the audio signal;
in response to determining that audio data corresponding to the wake word is present in the audio signal, perform an action corresponding to a request in the audio signal; and
in response to determining that audio data corresponding to the wake word is not present in the audio signal, close an audio signal stream from the client device.
1 Assignment
0 Petitions
Accused Products
Abstract
Features are disclosed for detecting words in audio using environmental information and/or contextual information in addition to acoustic features associated with the words to be detected. A detection model can be generated and used to determine whether a particular word, such as a keyword or “wake word,” has been uttered. The detection model can operate on features derived from an audio signal, contextual information associated with generation of the audio signal, and the like. In some embodiments, the detection model can be customized for particular users or groups of users based usage patterns associated with the users.
266 Citations
27 Claims
-
1. A system comprising:
-
a computer-readable memory storing executable instructions; and one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least; obtain from a client device; an audio signal, wherein a first portion of the audio signal comprises audio data likely corresponding to a wake word, and wherein a second portion of the audio signal does not comprise audio data likely corresponding to the wake word; contextual information associated with the audio signal; and information indicating the first portion of the audio signal comprises audio data likely corresponding to the wake word; obtain acoustic information and environmental information from the first portion of the audio signal, wherein the acoustic information reflects one or more characteristics of a voice in the audio signal, and wherein the environmental information reflects one or more characteristics of an environment in which sound in the audio signal was recorded; determine whether audio data corresponding to the wake word is present in the audio signal using a server-side detection model configured to generate a detection score using the contextual information, the environmental information, the acoustic information, and natural language understanding results generated based at least partly on at least one of the audio signal or a subsequent audio signal, wherein a detection score greater than a detection threshold indicates that audio data corresponding to the wake word is present in the audio signal; in response to determining that audio data corresponding to the wake word is present in the audio signal, perform an action corresponding to a request in the audio signal; and in response to determining that audio data corresponding to the wake word is not present in the audio signal, close an audio signal stream from the client device. - View Dependent Claims (2, 3, 4, 27)
-
-
5. A computer-implemented method comprising:
as implemented by one or more computing devices configured to execute specific instructions, obtaining from a client device; audio input comprising a plurality of portions of audio data, wherein less than all of the plurality of portions of audio data comprise audio data corresponding to a keyword detected by the client device; contextual information associated with the audio input; and information indicating a portion of audio data, of the plurality of portions of the audio data, that likely corresponds to the keyword; obtaining acoustic information and environmental information from the portion of audio data that likely corresponds to the keyword; determining that the portion of audio data corresponds to the keyword using a detection model configured to generate a detection score using the audio input, the contextual information, the environmental information, and the acoustic information, wherein a detection score satisfying a detection threshold indicates audio data corresponding to the keyword is present in the audio input; and performing an action corresponding to a request in the audio input. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 24, 25, 26)
-
16. Non-transitory computer-readable storage comprising executable code that, when executed, causes one or more computing devices to perform a process comprising:
-
obtaining from a client device; audio input, wherein less than all of the audio input comprises audio data likely corresponding to a keyword; contextual information associated with the audio input; and information indicating a portion of audio input, of a plurality of portions of the audio input, that likely corresponds to the keyword; obtaining acoustic information and environmental information from the portion of the audio input; determining that audio data corresponding to the keyword is present in the audio input using a detection model configured to generate a detection score using the audio input, the contextual information, the environmental information, and the acoustic information, wherein a detection score satisfying a detection threshold indicates audio data corresponding to the keyword is present in the audio input; and performing an action corresponding to a request in the audio input. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
-
Specification