Connected accessory for a voice-controlled device

US 10,127,908 B1
Filed: 01/23/2017
Issued: 11/13/2018
Est. Priority Date: 11/11/2016
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, from a voice-controlled device in an environment that includes the voice-controlled device and an accessory device, an indication that the voice-controlled device has established a wireless connection with the accessory device;

storing the indication in a user profile associated with the voice-controlled device;

receiving, from the voice-controlled device, audio data generated based at least in part on sound captured by a microphone of the voice-controlled device;

receiving a device identifier from the voice-controlled device;

accessing the user profile based at least in part on the device identifier;

determining, based at least in part on the indication in the user profile, that the accessory device is present in the environment;

determining to identify multiple domains of a natural language understanding (NLU) system based at least in part on determining that the accessory device is present in the environment;

generating, by performing automatic speech recognition (ASR) on the audio data, text data corresponding to the audio data;

sending the text data to the multiple domains of the NLU system;

identifying a first intent associated with a first domain of the multiple domains;

identifying a second intent associated with a second domain of the multiple domains;

identifying a named entity within the text data;

sending, to the voice-controlled device;

first information about a first storage location where audio content associated with the named entity is stored, anda first instruction corresponding to the first intent;

sending, to the accessory device in the environment;

second information about a second storage location where control information associated with the audio content is stored, the control information comprising at least viseme information, the viseme information comprising a series of timestamped mouth movement instructions, anda second instruction corresponding to the second intent;

at a first time based at least in part on the first instruction, initiating output of the audio content via a speaker of the voice-controlled device; and

at a second time based at least in part on the second instruction, operating a movable mouth of the accessory device or presenting mouth-related animations on a display of the accessory device according to the viseme information.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Coordinated operation of a voice-controlled device and an accessory device in an environment is described. A remote system processes audio data it receives from the voice-controlled device in the environment to identify a first intent associated with a first domain, a second intent associated with a second domain, and a named entity associated with the audio data. The remote system sends, to the voice-controlled device, first information for accessing main content associated with the named entity, and a first instruction corresponding to the first intent. The remote system also sends, to the accessory device, second information for accessing control information or supplemental content associated with the main content, and a second instruction corresponding to the second intent. The first and second instructions, when processed by the devices in the environment, cause coordinated operation of the voice-controlled device and the accessory device.

182 Citations

19 Claims

1. A method comprising:
- receiving, from a voice-controlled device in an environment that includes the voice-controlled device and an accessory device, an indication that the voice-controlled device has established a wireless connection with the accessory device;
  
  storing the indication in a user profile associated with the voice-controlled device;
  
  receiving, from the voice-controlled device, audio data generated based at least in part on sound captured by a microphone of the voice-controlled device;
  
  receiving a device identifier from the voice-controlled device;
  
  accessing the user profile based at least in part on the device identifier;
  
  determining, based at least in part on the indication in the user profile, that the accessory device is present in the environment;
  
  determining to identify multiple domains of a natural language understanding (NLU) system based at least in part on determining that the accessory device is present in the environment;
  
  generating, by performing automatic speech recognition (ASR) on the audio data, text data corresponding to the audio data;
  
  sending the text data to the multiple domains of the NLU system;
  
  identifying a first intent associated with a first domain of the multiple domains;
  
  identifying a second intent associated with a second domain of the multiple domains;
  
  identifying a named entity within the text data;
  
  sending, to the voice-controlled device;
  
  first information about a first storage location where audio content associated with the named entity is stored, anda first instruction corresponding to the first intent;
  
  sending, to the accessory device in the environment;
  
  second information about a second storage location where control information associated with the audio content is stored, the control information comprising at least viseme information, the viseme information comprising a series of timestamped mouth movement instructions, anda second instruction corresponding to the second intent;
  
  at a first time based at least in part on the first instruction, initiating output of the audio content via a speaker of the voice-controlled device; and
  
  at a second time based at least in part on the second instruction, operating a movable mouth of the accessory device or presenting mouth-related animations on a display of the accessory device according to the viseme information.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, further comprising, in response to sending the text data to the multiple domains, and prior to identifying the second intent:
    - receiving a first score from the first domain indicating a confidence that the first domain can provision a first service based on the text data;
      
      receiving a second score from the second domain indicating a confidence that the second domain can provision a second service based on the text data; and
      
      determining that the first score and the second score meet or exceed a threshold score,wherein determining to identify the multiple domains of a natural language understanding (NLU) system is further based on the first score and the second score meeting or exceeding the threshold score.
  - 3. The method of claim 1, further comprising sending, by the voice-controlled device, time synchronization information to the accessory device to synchronize output of words in the audio content with at least one of (i) operation of the movable mouth of the accessory device, or (ii) presentation of the mouth-related animations on the display.
  - 4. The method of claim 1, further comprising:
    - generating the viseme information from the audio content by;
      
      obtaining a transcription of words included in the audio content, the transcription of words being organized into a plurality of phrases associated with corresponding timestamps, anddetermining a distribution of vowels per word in a first phrase of the plurality of phrases,determining an overall length of the first phrase in units of time based at least in part on a first timestamp corresponding to the first phrase,determining a first length of a first word in the units of time based at least in part on the distribution of the vowels,determining a second length of a second word in the units of time based at least in part on the distribution of the vowels,determining that the first length is less than a predefined length threshold,selecting a first type of mouth movement based at least in part on the first length being less than the predefined length threshold,determining that the second length is equal to or greater than the predefined length threshold,selecting a second type of mouth movement based at least in part on the second length being equal to or greater than the predefined length threshold,creating the viseme information comprising the series of timestamped mouth movement instructions, the viseme information comprising the first type of mouth movement and the second type of mouth movement, and storing the viseme information in the second storage location.

5. A method comprising:
- receiving, from a voice-controlled device in an environment, audio data generated based at least in part on sound captured by the voice-controlled device;
  
  generating, by performing automatic speech recognition (ASR) on the audio data, text data corresponding to the audio data;
  
  identifying, based at least in part on the text data, a first intent associated with a first domain of multiple domains of a natural language understanding (NLU) system;
  
  identifying, based at least in part on the text data, a second intent associated with a second domain of the multiple domains, wherein the second intent causes the second domain to generate viseme information configured to cause a lip synch response by a second device;
  
  identifying a named entity within the text data;
  
  sending, to the voice-controlled device, a first instruction corresponding to the first intent, wherein the first instruction causes the voice-controlled device to output audio content at a first time on a speaker of the voice-controlled device; and
  
  sending, to the second device in the environment, a second instruction corresponding to the second intent, wherein the second instruction causes the second device to process the viseme information.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 6. The method of claim 5, further comprising, prior to identifying the second intent:
    - accessing a user profile associated with the voice-controlled device; and
      
      determining that the second device is associated with the user profile,wherein identifying the second intent associated with the second domain is based at least in part on determining that the second device is associated with the user profile.
  - 7. The method of claim 5, wherein:
    - the second device comprises an animatronic toy; and
      
      the viseme information causes operation of a movable mouth of the second device in a synchronized manner with output of words in the audio content via the speaker of the voice-controlled device.
  - 8. The method of claim 5, further comprising:
    - generating the viseme information from the audio content by;
      
      obtaining a transcription of words included in the audio content along with timestamp information associated with the transcription of the words, and transforming the transcription of the words into the viseme information; and
      
      storing the viseme information in a storage location that is accessible using the second information.
  - 9. The method of claim 8, wherein:
    - the transcription of the words in the audio content and the timestamp information is organized into a plurality of phrases associated with corresponding timestamps; and
      
      transforming the transcription of the words into the viseme information comprises;
      
      determining a distribution of vowels per word in a first phrase of the plurality of phrases,determining an overall length of the first phrase in units of time based at least in part on a first timestamp corresponding to the first phrase,determining a first length of a first word in the units of time based at least in part on the distribution of the vowels,determining a second length of a second word in the units of time based at least in part on the distribution of the vowels,determining that the first length is less than a predefined length threshold, selecting a first type of mouth movement based at least in part on the first length being less than the predefined length threshold,determining that the second length is equal to or greater than the predefined length threshold,selecting a second type of mouth movement based at least in part on the second length being equal to or greater than the predefined length threshold, andcreating a series of visemes comprising the first type of mouth movement and the second type of mouth movement, wherein the viseme information comprises the series of visemes aligned with the words in the transcription based on the timestamp information.
  - 10. The method of claim 9, wherein:
    - the first type of mouth movement comprises at least one of a square wave mouth movement, a sine wave mouth movement, or a first predefined waveform corresponding to the first word; and
      
      the second type of mouth movement comprises another of the square wave mouth movement, the sine wave mouth movement, or a second predefined waveform corresponding to the second word.
  - 11. The method of claim 5, further comprising, prior to identifying the second intent:
    - sending the text data to the multiple domains of the NLU system;
      
      receiving a first score from the first domain indicating a confidence that the first domain can provision a first service based on the text data;
      
      receiving a second score from the second domain indicating a confidence that the second domain can provision a second service based on the text data;
      
      determining that the first score and the second score meet or exceed a threshold score; and
      
      determining to identify the multiple domains of the NLU system based at least in part on the first score and the second score meeting or exceeding the threshold score.
  - 12. The method of claim 5, further comprising sending, to the voice-controlled device, first information for accessing the audio content associated with the named entity, wherein the first instruction further causes the voice-controlled device to access the audio content using the first information.
  - 13. The method of claim 5, further comprising sending, to the second device, second information for accessing the viseme information, wherein the second instruction further causes the second device to access the viseme information using the second information.
  - 14. The method of claim 5, wherein the second instruction further causes the second device to process dance information, the method further comprising:
    - generating the dance information from the audio content by;
      
      obtaining beat information based at least in part on the audio content along with timestamp information associated with the beat information; and
      
      aligning, based at least in part on the timestamp information, dance movements with the beat information to create the dance information; and
      
      storing the dance information in a storage location that is accessible using the second information.
  - 15. The method of claim 5, further comprising:
    - determining a number of beats per unit time based at least in part on beat information obtained from the audio content;
      
      determining that the number of the beats per the unit time is equal to or greater than a predefined threshold; and
      
      selecting an intensity level among multiple available intensity levels for operating a component of the second device based at least in part on the number of the beats per the unit time being equal to or greater than the predefined threshold.

16. A system comprising:
- at least one processor; and
  
  memory storing computer-executable instructions that, when executed by the at least one processor, cause the at least one processor to;
  
  receive, from a voice-controlled device in an environment, audio data generated based at least in part on sound captured by the voice-controlled device;
  
  generate, by performing automatic speech recognition (ASR) on the audio data, text data corresponding to the audio data;
  
  identify, based at least in part on the text data, a first intent associated with a first domain of multiple domains of a natural language understanding (NLU) system;
  
  identify, based at least in part on the text data, a second intent associated with a second domain of the multiple domains, wherein the second intent causes the second domain to generate viseme information configured to cause a lip synch response by a second device;
  
  identify a named entity within the text data;
  
  send, to the voice-controlled device, a first instruction corresponding to the first intent, wherein the first instruction causes the voice-controlled device to output audio content at a first time on a speaker of the voice-controlled device; and
  
  send, to the second device in the environment, a second instruction corresponding to the second intent, wherein the second instruction causes the second device to process the viseme information control.
- View Dependent Claims (17, 18, 19)
- - 17. The system of claim 16, wherein:
    - the second instruction further causes the second device to select a mode of operation among multiple available modes of operation; and
      
      the viseme information causes the second device to operate a component of the second device based at least in part on the mode of operation.
  - 18. The system of claim 16, wherein the viseme information causes the second device to present animations on a display of the second device in a synchronized manner with the output of the audio content.
  - 19. The system of claim 16, wherein the computer-executable instructions, when executed by the at least one processor, further cause the at least one processor to, prior to identifying the second intent:
    - access a user profile associated with the voice-controlled device; and
      
      determine that the second device is associated with the user profile,wherein identifying the second intent associated with the second domain is based at least in part on determining that the second device is associated with the user profile.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Deller, Derick, Naik, Apoorv, Adams, Zoe, Appleman, Aslan, Cornelius, Link, Klein, Pete
Primary Examiner(s)
Washburn, Daniel
Assistant Examiner(s)
Nguyen, Timothy

Application Number

US15/413,105
Time in Patent Office

659 Days
Field of Search

704235
US Class Current
CPC Class Codes

G06F 3/167   Audio in a user interface, ...

G06F 40/295   Named entity recognition

G06T 13/205   driven by audio data

G10H 2210/076   for extraction of timing, t...

G10L 13/00   Speech synthesis; Text to s...

G10L 13/047   Architecture of speech synt...

G10L 13/10   Prosody rules derived from ...

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 15/30   Distributed recognition, e....

G10L 2015/088   Word spotting

G10L 2015/223   Execution procedure of a sp...

H04L 67/02   based on web technology, e....

H04L 67/025   for remote control or remot...

H04L 67/12   specially adapted for propr...

H04L 67/306   User profiles

Connected accessory for a voice-controlled device

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

182 Citations

19 Claims

Specification

Use Cases

Quick Links

Others

Connected accessory for a voice-controlled device

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

182 Citations

19 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others