Altering audio to improve automatic speech recognition

US 10,354,649 B2
Filed: 03/12/2018
Issued: 07/16/2019
Est. Priority Date: 09/26/2012
Status: Active Grant

First Claim

Patent Images

1. An apparatus comprising:

at least one speaker;

at least one microphone;

one or more processors; and

computer-readable media storing computer-executable instructions that, when executed on the one or more processors, cause the one or more processors to;

cause the at least one speaker to output first content;

receive a first input audio signal generated by the at least one microphone based at least in part on sound captured by the at least one microphone;

detect a predefined utterance within the first input audio signal, the predefined utterance indicating that a voice command is going to be provided;

alter, based at least in part on detecting the predefined utterance, output of the first content by the at least one speaker for a first period of time;

receive a second input audio signal generated by the at least one microphone based at least in part on sound captured by the at least one microphone during at least a portion of the first period of time;

send, to one or more remote computing resources, the second input audio signal for identifying the voice command in the second input audio signal; and

cause the at least one speaker to at least one of;

output the first content for a second period of time that is after the first period of time;

oroutput second content for the second period of time, wherein the second content is different from the first content.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for altering audio being output by a voice-controlled device, or another device, to enable more accurate automatic speech recognition (ASR) by the voice-controlled device. For instance, a voice-controlled device may output audio within an environment using a speaker of the device. While outputting the audio, a microphone of the device may capture sound within the environment and may generate an audio signal based on the captured sound. The device may then analyze the audio signal to identify speech of a user within the signal, with the speech indicating that the user is going to provide a subsequent command to the device. Thereafter, the device may alter the output of the audio (e.g., attenuate the audio, pause the audio, switch from stereo to mono, etc.) to facilitate speech recognition of the user'"'"'s subsequent command.

Citations

20 Claims

1. An apparatus comprising:
- at least one speaker;
  
  at least one microphone;
  
  one or more processors; and
  
  computer-readable media storing computer-executable instructions that, when executed on the one or more processors, cause the one or more processors to;
  
  cause the at least one speaker to output first content;
  
  receive a first input audio signal generated by the at least one microphone based at least in part on sound captured by the at least one microphone;
  
  detect a predefined utterance within the first input audio signal, the predefined utterance indicating that a voice command is going to be provided;
  
  alter, based at least in part on detecting the predefined utterance, output of the first content by the at least one speaker for a first period of time;
  
  receive a second input audio signal generated by the at least one microphone based at least in part on sound captured by the at least one microphone during at least a portion of the first period of time;
  
  send, to one or more remote computing resources, the second input audio signal for identifying the voice command in the second input audio signal; and
  
  cause the at least one speaker to at least one of;
  
  output the first content for a second period of time that is after the first period of time;
  
  oroutput second content for the second period of time, wherein the second content is different from the first content.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The apparatus as recited in claim 1, wherein altering the output of the first content comprises at least one of lowering a volume at which the at least one speaker outputs the first content during the first period of time or stopping outputting of the first content for the first period of time.
  - 3. The apparatus as recited in claim 1, wherein altering the output of the first content comprises switching from outputting the first content in stereo to outputting the first content in mono for the first period of time.
  - 4. The apparatus as recited in claim 1, wherein altering the output of the first content comprises stopping outputting the first content for the first period of time, the apparatus further comprising:
    - a switch to decouple the at least one speaker from a power source during the first period of time.
  - 5. The apparatus as recited in claim 1, wherein the computer-executable instructions further cause the one or more processors to:
    - determine, based at least in part on the first input audio signal, a physical position of a user relative to the at least one speaker, the user being a source of the predefined utterance; and
      
      determine how to alter output of the first content based at least in part on the physical position.
  - 6. The apparatus as recited in claim 1, wherein the computer-executable instructions further cause the one or more processors to:
    - determine an identity of a user, the user being a source of the predefined utterance;
      
      determine a user profile associated with the user; and
      
      determine how to alter output of the first content based at least in part on the identity of the user.
  - 7. The apparatus of claim 1, the computer-readable media storing further computer-executable instructions that, when executed on the one or more processors, cause the one or more processors to further:
    - determine a type of the output content, wherein;
      
      based at least in part on the output content being a first type, the one or more processors alter the output in a first manner; and
      
      based at least in part on the output content being a second type, the one or more processors alter the output in a second manner.

8. A method implemented at least in part by an apparatus comprising a at least one speaker and at least one microphone, the method comprising:
- causing the at least one speaker to output first content;
  
  generating a first input audio signal using the at least one microphone based at least in part on sound captured by the at least one microphone;
  
  detecting a predefined utterance within the first input audio signal, the predefined utterance indicating that a voice command is going to be provided;
  
  determining a type of the first content,altering, based at least in part on the detecting of the predefined utterance and the type of the first content, output of the first content by the at least one speaker for a first period of time;
  
  generating a second input audio signal using the at least one microphone based at least in part on sound captured by the at least one microphone during at least a portion of the first period of time;
  
  sending, to one or more remote computing resources and based at least in part on the detecting of the predefined utterance, the input audio signal for identifying the voice command in the input audio signal; and
  
  causing the at least one speaker to at least one of;
  
  output the first content for a second period of time that is after the first period of time;
  
  oroutput second content for the second period of time, wherein the second content is different from the first content.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The method as recited in claim 8, wherein the altering the output of the first content comprises lowering a volume at which the at least one speaker outputs the first content during the first period of time.
  - 10. The method as recited in claim 8, wherein the altering the output of the first content comprises switching from outputting the first content in stereo to outputting the first content in mono for the first period of time.
  - 11. The method as recited in claim 8, wherein the altering the output of the first content comprises stopping outputting the first content for the first period of time.
  - 12. The method as recited in claim 11, wherein stopping outputting the first content comprises causing the at least one speaker to power off during the first period of time.
  - 13. The method as recited in claim 8, further comprising:
    - determining a physical position of a user relative to the at least one speaker, the user being a source of the predefined utterance; and
      
      determining how to alter output of the first content based at least in part on the physical position.
  - 14. The method as recited in claim 8, further comprising:
    - determining an identity of a user that provided the predefined utterance; and
      
      determining how to alter output of the first content based at least in part on the identity of the user.

15. Computer-readable media storing computer-executable instructions that, when executed on one or more processors of an apparatus, cause the one or more processors to:
- cause at least one speaker of a device to output first content, the device comprising the at least one speaker and at least one microphone;
  
  generate a first input audio signal using the at least one microphone based at least in part on sound captured by the at least one microphone;
  
  detect a predefined utterance within the first input audio signal, the predefined utterance indicating that a voice command is going to be provided;
  
  determine that the first content comprises one of a first content type or a second content type;
  
  alter, based at least in part on the detecting of the predefined utterance and based at least in part on the first content comprising the first content type, output of the first content by the at least one speaker in a first manner for a first period of time;
  
  alter, based at least in part on the detecting of the predefined utterance and based at least in part on the first content comprising the second content type, output of the first content by the at least one speaker in a second manner for the first period of time;
  
  generate a second input audio signal using the at least one microphone based at least in part on sound captured by the at least one microphone during at least a portion of the first period of time;
  
  send, to one or more remote computing resources and based at least in part on the detecting of the predefined utterance, the input audio signal for identifying the voice command in the input audio signal; and
  
  cause the at least one speaker to at least one of;
  
  output the first content for a second period of time that is after the first period of time;
  
  oroutput second content for the second period of time, wherein the second content is different from the first content.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer-readable media as recited in claim 15, wherein the altering output of the first content in the first manner or the altering output of the first content in the second manner comprises lowering a volume at which the at least one speaker outputs the first content during the first period of time.
  - 17. The computer-readable media as recited in claim 15, wherein the altering the output of the first content in the first manner or the altering output of the first content in the second manner comprises switching from outputting the first content in stereo to outputting the first content in mono for the first period of time.
  - 18. The computer-readable media as recited in claim 15, wherein the altering output of the first content in the first manner or the altering output of the first content in the second manner comprises stopping outputting the first content for the first period of time.
  - 19. The computer-readable media as recited in claim 15, wherein the computer-executable instructions further cause the one or more processors to:
    - determine a physical position of a user relative to the apparatus, the user being a source of the predefined utterance; and
      
      determine how to alter output of the first content based at least in part on the physical position.
  - 20. The computer-readable media as recited in claim 15, wherein the computer-executable instructions further cause the one or more processors to:
    - determine an identity of a user that provided the predefined utterance;
      
      determine a user profile associated with the user; and
      
      determine how to alter output of the first content based at least in part on the user profile.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Hart, Gregory Michael, Worley, III, William Spencer
Primary Examiner(s)
Paul, Disler

Application Number

US15/918,608
Publication Number

US 20180204574A1
Time in Patent Office

491 Days
Field of Search

381 86, 381110, 381104, 381107, 381 56- 58
US Class Current
CPC Class Codes

G10L 15/20   Speech recognition techniqu...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 17/00   Speaker identification or v...

G10L 2015/223   Execution procedure of a sp...

G11B 27/005   Reproducing at a different ...

H03G 3/32   the control being dependent...

H03G 5/02   Manually-operated control

H04R 3/12   for distributing signals to...

Altering audio to improve automatic speech recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Altering audio to improve automatic speech recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links