Context driven device arbitration

US 10,482,904 B1
Filed: 08/15/2017
Issued: 11/19/2019
Est. Priority Date: 08/15/2017
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more processors;

computer-readable media storing computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising;

receiving, from a first speech interface device, a first audio signal representing a speech utterance of a user captured by a first microphone associated with the first speech interface device;

receiving, from the first speech interface device, first metadata associated with the first speech interface device, wherein the first metadata indicates a first device state of the first speech interface device;

receiving, from a second speech interface device, a second audio signal representing the speech utterance of the user captured by a second microphone associated with the second speech interface device;

receiving, from the second speech interface device, second metadata associated with the second speech interface device, wherein the second metadata indicates a second device state of the second speech interface device;

determining, from the first device state and the second device state, a first confidence score for the first speech interface device, wherein the first confidence score represents a first likelihood that the first speech interface device perform an action responsive to the speech utterance;

determining, from the first device state and the second device state, a second confidence score for the second speech interface device, wherein the second confidence score represents a second likelihood that the second speech interface device perform the action responsive to the speech utterance;

determining, based at least in part on one of the first confidence score or the second confidence score, that the first speech interface device is to perform the action responsive to the speech utterance;

generating response data representing the action responsive to the speech utterance; and

sending, to the first speech interface device, the response data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

This disclosure describes, in part, context-driven device arbitration techniques to select a speech interface device from multiple speech interface devices to provide a response to a command included in a speech utterance of a user. In some examples, the context-driven arbitration techniques may include executing multiple pipeline instances to analyze audio signals and device metadata received from each of the multiple speech interface devices which detected the speech utterance. A remote speech processing service may execute the multiple pipeline instances and analyze the audio signals and/or metadata, at various stages of the pipeline instances, to determine which speech interface device is to respond to the speech utterance.

Citations

20 Claims

1. A system comprising:
- one or more processors;
  
  computer-readable media storing computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising;
  
  receiving, from a first speech interface device, a first audio signal representing a speech utterance of a user captured by a first microphone associated with the first speech interface device;
  
  receiving, from the first speech interface device, first metadata associated with the first speech interface device, wherein the first metadata indicates a first device state of the first speech interface device;
  
  receiving, from a second speech interface device, a second audio signal representing the speech utterance of the user captured by a second microphone associated with the second speech interface device;
  
  receiving, from the second speech interface device, second metadata associated with the second speech interface device, wherein the second metadata indicates a second device state of the second speech interface device;
  
  determining, from the first device state and the second device state, a first confidence score for the first speech interface device, wherein the first confidence score represents a first likelihood that the first speech interface device perform an action responsive to the speech utterance;
  
  determining, from the first device state and the second device state, a second confidence score for the second speech interface device, wherein the second confidence score represents a second likelihood that the second speech interface device perform the action responsive to the speech utterance;
  
  determining, based at least in part on one of the first confidence score or the second confidence score, that the first speech interface device is to perform the action responsive to the speech utterance;
  
  generating response data representing the action responsive to the speech utterance; and
  
  sending, to the first speech interface device, the response data.
- View Dependent Claims (2, 3, 4)
- - 2. The system of claim 1, wherein:
    - the first metadata further includes a first contextual attribute associated with the first audio signal, the first contextual attribute indicating at least one of;
      
      a signal-to-noise measurement of the first audio signal;
      
      an amplitude of the first audio signal;
      
      a level of voice presence in the first audio signal;
      
      a first distance of the user to the first speech interface device;
      
      orfirst image data representing an environment of the user;
      
      the second metadata further includes a second contextual attribute of the second audio signal, the second contextual attribute indicating at least one of;
      
      a signal-to-noise measurement of the second audio signal;
      
      an amplitude of the second audio signal;
      
      a level of voice presence in the second audio signal;
      
      a second distance of the user to the second speech interface device;
      
      orsecond image data representing the environment of the user;
      
      determining the first confidence score for the first speech interface device is based at least in part on the first contextual attribute; and
      
      determining the second confidence score for the second speech interface device is based at least in part on the second contextual attribute.
  - 3. The system of claim 1, the operations further comprising:
    - generating, using automatic speech recognition (ASR) on the first audio signal, first text data corresponding to the speech utterance, wherein the first text data is associated with a first ASR confidence score; and
      
      generating, using automatic speech recognition (ASR) on the second audio signal, second text data corresponding to the speech utterance, wherein the second text data is associated with a second ASR confidence score,wherein;
      
      determining the first confidence score for the first speech interface device is based at least in part on the first ASR confidence score, anddetermining the second confidence score for the second speech interface device is based at least in part on the second ASR confidence score.
  - 4. The system of claim 3, the operations further comprising:
    - determining, using natural language understanding on at least one of the first text data or the second text data, an intent by the user to have the first speech interface device perform the action,wherein determining that the first speech interface device is to perform the action is based at least in part on the intent.

5. A method comprising:
- receiving, at a remote speech processing system and from a first device, first audio data representing speech;
  
  receiving, at the remote speech processing system and from a second device, second audio data representing the speech;
  
  receiving, from the first device, first metadata;
  
  receiving, from the second device, second metadata;
  
  determining a speechlet to generate a response to the speech;
  
  sending, to the speechlet, a first device identifier of the first device;
  
  sending, to the speechlet, a second device identifier of the second device;
  
  sending the first metadata to the speechlet;
  
  sending the second metadata to the speechlet;
  
  receiving, from the speechlet, the first device identifier indicating the first device was selected by the speechlet to perform the response;
  
  receiving, from the speechlet, response data corresponding to the response to the speech; and
  
  sending, from the remote speech processing system and to the first device, the response data to cause the first device to perform the response.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 6. The method of claim 5, further comprising:
    - receiving, from the first device, an indication of a first device state of the first device;
      
      receiving, from the second device, an indication of a second device state of the second device;
      
      sending, to the speechlet, data indicating the first device state; and
      
      sending, to the speechlet, data indicating the second device state.
  - 7. The method of claim 5, further comprisinggenerating, using automatic speech recognition on the first audio data, first text data corresponding to the speech;
    - generating, using automatic speech recognition on the second audio data, second text data corresponding to the speech;
      
      determining a first confidence score associated with the first text data;
      
      determining a second confidence score associated with the second text data;
      
      sending, to the speechlet, the first confidence score; and
      
      sending, to the speechlet, the second confidence score.
  - 8. The method of claim 7, further comprising:
    - determining, using natural language understanding on at least one of the first text data or the second text data, an intent to have the first device perform the response; and
      
      sending, to the speechlet, an indication of the intent.
  - 9. The method of claim 5, further comprising:
    - generating, using automatic speech recognition on at least one of the first audio data or the second audio data, text data corresponding to the speech;
      
      analyzing one or more words included in the text data; and
      
      identifying, from the one or more words, at least one of;
      
      a device name associated with the first device;
      
      a verb or noun associated with a device state of the first device;
      
      ora verb or noun associated with a capability of the first device.
  - 10. The method of claim 5, further comprisingdetermining, based at least in part on the first metadata, a first confidence score for the first device;
    - determining, based at least in part on the second metadata, a second confidence score for the second device;
      
      sending the first confidence score to the speechlet;
      
      sending the second confidence score to the speechlet;
      
      generating, using automatic speech recognition on at least one of the first audio data or the second audio data, text data corresponding to the speech;
      
      applying, based at least in part on the text data, a first weighting factor to the first confidence score to generate a first weighted confidence score and a second weighting factor to the second confidence score to generate a second weighted confidence score;
      
      sending the first weighted confidence score to the speechlet;
      
      sending the second weighted confidence score to the speechlet;
      
      determining, using natural language understanding on the text data, an intent associated with the speech; and
      
      sending, to the speechlet, an indication of the intent.
  - 11. The method of claim 5, further comprising:
    - identifying a first device state of the first device;
      
      determining a first confidence score associated with the first device based at least in part on the first device state; and
      
      sending the first confidence score to the speechlet.
  - 12. The method of claim 5,wherein the first metadata comprises audio signal data including at least one of:
    - a signal-to-noise measurement of a signal represented by the first audio data;
      
      an amplitude of the signal represented by the first audio data;
      
      ora level of voice presence in the signal represented by the first audio data;
      
      further comprising determining the first confidence based at least in part on the audio signal data.
  - 13. The method of claim 5, wherein:
    - the first metadata includes proximity data indicating a distance between a user and the first device;
      
      determining a first confidence score associated with the first device is based at least in part on the proximity data; and
      
      sending the first confidence score to the speechlet.
  - 14. The method of claim 5, whereinthe first metadata includes image data corresponding to an image of an environment of the first device,further comprising:
    - analyzing the image data to identify at least a portion of a face of a user;
      
      determining a first confidence score associated with the first device based at least in part on identifying the at least the portion of the face of the usersending the first confidence score to the speechlet.

15. A system comprising:
- one or more processors;
  
  computer-readable media storing computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising;
  
  receiving data representing an intent of a speech utterance captured by a client device;
  
  receiving a first device identifier indicating a first device that generated first audio data representing the speech utterance;
  
  receiving a second device identifier indicating a second device that generated second audio data representing the speech utterance;
  
  receiving first data indicating a first likelihood that the first device is to perform an action responsive to the speech utterance;
  
  receiving second data indicating a second likelihood that the second device is to perform the action responsive to the speech utterance;
  
  determining a first device state for the first device;
  
  determining a second device state for the second device; and
  
  determining, based at least in part on the first device state and at least one of the first data or the second data, that the first device is to perform the action responsive to the speech utterance.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system of claim 15, the operations further comprising:
    - identifying an account associated with the first device and the second device;
      
      identifying a third device associated with the account; and
      
      determining third device state for the third device.
  - 17. The system of claim 15, wherein determining that the first device is to perform the action responsive to the speech utterance is further based at least in part on the intent of the speech.
  - 18. The system of claim 15, wherein the speech comprises a first speech, and the operations further comprising:
    - receiving data representing another intent of a second speech utterance;
      
      determining a third device state for a third device; and
      
      determining, based at least in part one of the other intent or the third device state, that the third device is to perform another action responsive to the second speech utterance.
  - 19. The system of claim 15, wherein determining that the first device is to perform the action responsive to the speech utterance comprises at least one of:
    - determining the first likelihood is greater than the second likelihood;
      
      ordetermining that the first confidence score is greater than a threshold confidence score.
  - 20. The system of claim 15, the operations further comprising:
    - determining a first confidence score associated with the first device based at least in part on the first device state;
      
      determining a second confidence score associated with the second device based at least in part on the second device state; and
      
      determining that the first confidence score is greater than the second confidence score.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Hardie, Tony Roy, Oliver, Brian Alexander, Gundeti, Vikram Kumar
Primary Examiner(s)
Abebe, Daniel

Application Number

US15/677,848
Time in Patent Office

826 Days
Field of Search

704275
US Class Current
CPC Class Codes

G06F 40/30   Semantic analysis

G09B 19/04   Speaking with audible prese...

G10L 15/063   Training

G10L 15/193   Formal grammars, e.g. finit...

G10L 15/20   Speech recognition techniqu...

G10L 15/22   Procedures used during a sp...

G10L 2015/226   using non-speech characteri...

G10L 25/87   Detection of discrete point...

Context driven device arbitration

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Context driven device arbitration

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links