Keyword detection modeling using contextual and environmental information

US 9,697,828 B1
Filed: 06/20/2014
Issued: 07/04/2017
Est. Priority Date: 06/20/2014
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a computer-readable memory storing executable instructions; and

one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least;

obtain from a client device;

an audio signal, wherein a first portion of the audio signal comprises audio data likely corresponding to a wake word, and wherein a second portion of the audio signal does not comprise audio data likely corresponding to the wake word;

contextual information associated with the audio signal; and

information indicating the first portion of the audio signal comprises audio data likely corresponding to the wake word;

obtain acoustic information and environmental information from the first portion of the audio signal, wherein the acoustic information reflects one or more characteristics of a voice in the audio signal, and wherein the environmental information reflects one or more characteristics of an environment in which sound in the audio signal was recorded;

determine whether audio data corresponding to the wake word is present in the audio signal using a server-side detection model configured to generate a detection score using the contextual information, the environmental information, the acoustic information, and natural language understanding results generated based at least partly on at least one of the audio signal or a subsequent audio signal, wherein a detection score greater than a detection threshold indicates that audio data corresponding to the wake word is present in the audio signal;

in response to determining that audio data corresponding to the wake word is present in the audio signal, perform an action corresponding to a request in the audio signal; and

in response to determining that audio data corresponding to the wake word is not present in the audio signal, close an audio signal stream from the client device.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Features are disclosed for detecting words in audio using environmental information and/or contextual information in addition to acoustic features associated with the words to be detected. A detection model can be generated and used to determine whether a particular word, such as a keyword or “wake word,” has been uttered. The detection model can operate on features derived from an audio signal, contextual information associated with generation of the audio signal, and the like. In some embodiments, the detection model can be customized for particular users or groups of users based usage patterns associated with the users.

266 Citations

27 Claims

1. A system comprising:
- a computer-readable memory storing executable instructions; and
  
  one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least;
  
  obtain from a client device;
  
  an audio signal, wherein a first portion of the audio signal comprises audio data likely corresponding to a wake word, and wherein a second portion of the audio signal does not comprise audio data likely corresponding to the wake word;
  
  contextual information associated with the audio signal; and
  
  information indicating the first portion of the audio signal comprises audio data likely corresponding to the wake word;
  
  obtain acoustic information and environmental information from the first portion of the audio signal, wherein the acoustic information reflects one or more characteristics of a voice in the audio signal, and wherein the environmental information reflects one or more characteristics of an environment in which sound in the audio signal was recorded;
  
  determine whether audio data corresponding to the wake word is present in the audio signal using a server-side detection model configured to generate a detection score using the contextual information, the environmental information, the acoustic information, and natural language understanding results generated based at least partly on at least one of the audio signal or a subsequent audio signal, wherein a detection score greater than a detection threshold indicates that audio data corresponding to the wake word is present in the audio signal;
  
  in response to determining that audio data corresponding to the wake word is present in the audio signal, perform an action corresponding to a request in the audio signal; and
  
  in response to determining that audio data corresponding to the wake word is not present in the audio signal, close an audio signal stream from the client device.
- View Dependent Claims (2, 3, 4, 27)
- - 2. The system of claim 1, wherein the server-side detection model comprises a statistical classifier or a probabilistic logic network.
  - 3. The system of claim 1, wherein the server-side detection model is further configured to generate the detection score using automatic speech recognition results generated based at least partly on at least one of the audio signal or the subsequent audio signal.
  - 4. The system of claim 1, wherein the one or more processors are further programmed to:
    - store information regarding determining whether audio data corresponding to the wake word is present in the audio signal;
      
      train a customized client-side detection model using training data based at least partly on the information regarding determining whether audio data corresponding to the wake word is present in the audio signal; and
      
      transmit the customized client-side detection model to the client device.
  - 27. The system of claim 1, wherein the executable instructions to close the audio signal stream from the client device comprise executable instructions to at least:
    - close a connection with the client device, reject audio data from the client device, or instruct the client device to stop transmitting audio data.

5. A computer-implemented method comprising:
- as implemented by one or more computing devices configured to execute specific instructions,obtaining from a client device;
  
  audio input comprising a plurality of portions of audio data, wherein less than all of the plurality of portions of audio data comprise audio data corresponding to a keyword detected by the client device;
  
  contextual information associated with the audio input; and
  
  information indicating a portion of audio data, of the plurality of portions of the audio data, that likely corresponds to the keyword;
  
  obtaining acoustic information and environmental information from the portion of audio data that likely corresponds to the keyword;
  
  determining that the portion of audio data corresponds to the keyword using a detection model configured to generate a detection score using the audio input, the contextual information, the environmental information, and the acoustic information, wherein a detection score satisfying a detection threshold indicates audio data corresponding to the keyword is present in the audio input; and
  
  performing an action corresponding to a request in the audio input.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 24, 25, 26)
- - 6. The computer-implemented method of claim 5, wherein the detection model is trained using training data comprising contextual information, acoustic information, environmental information, and linguistic information.
  - 7. The computer-implemented method of claim 5, wherein the acoustic information comprises natural language understanding results generated using the audio input.
  - 8. The computer-implemented method of claim 5, wherein the acoustic information reflects one or more characteristics of a voice in the audio input.
  - 9. The computer-implemented method of claim 5, wherein the environmental information reflects one or more characteristics of an environment in which sound in the audio input was captured.
  - 10. The computer-implemented method of claim 5, wherein the contextual information reflects at least one of a time at which the audio input was generated, a geographic location of an audio input device, or a physical orientation of the audio input device with respect to a user.
  - 11. The computer-implemented method of claim 5, wherein a detection score failing to satisfy the detection threshold indicates that the portion of the audio input does not correspond to the keyword.
  - 12. The computer-implemented method of claim 5, wherein the detection model comprises a probabilistic logic network comprising a rule defined by one of a system administrator or a system user.
  - 13. The computer-implemented method of claim 5, wherein the detection model comprises a probabilistic logic network comprising a rule automatically generated using a machine learning process.
  - 14. The computer-implemented method of claim 5, further comprising:
    - storing information regarding use of the detection model to determine whether the portion of the audio input corresponds to the keyword; and
      
      training a customized client-side detection model using training data based at least partly on the information regarding use of the detection model.
  - 15. The computer-implemented method of claim 14, further comprising providing the customized client-side detection model to one or more client computing devices.
  - 24. The computer-implemented method of claim 5, wherein the detection model is a server-side detection model.
  - 25. The computer-implemented method of claim 5, wherein obtaining, from the client device, the audio input comprises obtaining the audio input in batches from the client device, wherein the batches of the audio input have previously been generated on the client device from an audio signal.
  - 26. The computer-implemented method of claim 5, wherein determining whether the portion of the audio input corresponds to the keyword is based at least in part on the information regarding at least one of a start position or an end position of the portion of the audio input.

16. Non-transitory computer-readable storage comprising executable code that, when executed, causes one or more computing devices to perform a process comprising:
- obtaining from a client device;
  
  audio input, wherein less than all of the audio input comprises audio data likely corresponding to a keyword;
  
  contextual information associated with the audio input; and
  
  information indicating a portion of audio input, of a plurality of portions of the audio input, that likely corresponds to the keyword;
  
  obtaining acoustic information and environmental information from the portion of the audio input;
  
  determining that audio data corresponding to the keyword is present in the audio input using a detection model configured to generate a detection score using the audio input, the contextual information, the environmental information, and the acoustic information, wherein a detection score satisfying a detection threshold indicates audio data corresponding to the keyword is present in the audio input; and
  
  performing an action corresponding to a request in the audio input.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
- - 17. The non-transitory computer-readable storage of claim 16, wherein the detection model is trained using training data comprising contextual information, acoustic information, environmental information, and linguistic information.
  - 18. The non-transitory computer-readable storage of claim 16, wherein the acoustic information comprises natural language understanding results generated using the audio input.
  - 19. The non-transitory computer-readable storage of claim 16, wherein the acoustic information reflects one or more characteristics of a voice in the audio input.
  - 20. The non-transitory computer-readable storage of claim 16, wherein the environmental information reflects one or more characteristics of an environment in which a voice in the audio input was captured.
  - 21. The non-transitory computer-readable storage of claim 16, wherein the contextual information reflects at least one of a time at which the audio input was generated, a geographic location of an input device, or a physical orientation of the input device with respect to a user.
  - 22. The non-transitory computer-readable storage of claim 16, wherein a detection score failing to satisfy the detection threshold indicates that audio data corresponding to the keyword is not present in the audio input.
  - 23. The non-transitory computer-readable storage of claim 16, the process further comprising:
    - storing information regarding use of the detection model to determine whether the portion of the audio input corresponds to the keyword; and
      
      training a customized client-side detection model using training data based at least partly on the information regarding use of the detection model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Matsoukas, Spyridon, Ramachandran, Rajiv, Vitaladevuni, Shiv Naga Prasad, Prasad, Rohit, Basye, Kenneth John, Hoffmeister, Bjorn
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
Kim, Jonathan

Application Number

US14/311,163
Time in Patent Office

1,110 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/08   Speech classification or se...

G10L 15/18   using natural language mode...

G10L 15/30   Distributed recognition, e....

G10L 2015/088   Word spotting

Keyword detection modeling using contextual and environmental information

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

266 Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Keyword detection modeling using contextual and environmental information

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

266 Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links