Context driven device arbitration
First Claim
1. A system comprising:
- one or more processors;
computer-readable media storing computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising;
receiving, from a first speech interface device, a first audio signal representing a speech utterance of a user captured by a first microphone associated with the first speech interface device;
receiving, from the first speech interface device, first metadata associated with the first speech interface device, wherein the first metadata indicates a first device state of the first speech interface device;
receiving, from a second speech interface device, a second audio signal representing the speech utterance of the user captured by a second microphone associated with the second speech interface device;
receiving, from the second speech interface device, second metadata associated with the second speech interface device, wherein the second metadata indicates a second device state of the second speech interface device;
determining, from the first device state and the second device state, a first confidence score for the first speech interface device, wherein the first confidence score represents a first likelihood that the first speech interface device perform an action responsive to the speech utterance;
determining, from the first device state and the second device state, a second confidence score for the second speech interface device, wherein the second confidence score represents a second likelihood that the second speech interface device perform the action responsive to the speech utterance;
determining, based at least in part on one of the first confidence score or the second confidence score, that the first speech interface device is to perform the action responsive to the speech utterance;
generating response data representing the action responsive to the speech utterance; and
sending, to the first speech interface device, the response data.
1 Assignment
0 Petitions
Accused Products
Abstract
This disclosure describes, in part, context-driven device arbitration techniques to select a speech interface device from multiple speech interface devices to provide a response to a command included in a speech utterance of a user. In some examples, the context-driven arbitration techniques may include executing multiple pipeline instances to analyze audio signals and device metadata received from each of the multiple speech interface devices which detected the speech utterance. A remote speech processing service may execute the multiple pipeline instances and analyze the audio signals and/or metadata, at various stages of the pipeline instances, to determine which speech interface device is to respond to the speech utterance.
-
Citations
20 Claims
-
1. A system comprising:
-
one or more processors; computer-readable media storing computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising; receiving, from a first speech interface device, a first audio signal representing a speech utterance of a user captured by a first microphone associated with the first speech interface device; receiving, from the first speech interface device, first metadata associated with the first speech interface device, wherein the first metadata indicates a first device state of the first speech interface device; receiving, from a second speech interface device, a second audio signal representing the speech utterance of the user captured by a second microphone associated with the second speech interface device; receiving, from the second speech interface device, second metadata associated with the second speech interface device, wherein the second metadata indicates a second device state of the second speech interface device; determining, from the first device state and the second device state, a first confidence score for the first speech interface device, wherein the first confidence score represents a first likelihood that the first speech interface device perform an action responsive to the speech utterance; determining, from the first device state and the second device state, a second confidence score for the second speech interface device, wherein the second confidence score represents a second likelihood that the second speech interface device perform the action responsive to the speech utterance; determining, based at least in part on one of the first confidence score or the second confidence score, that the first speech interface device is to perform the action responsive to the speech utterance; generating response data representing the action responsive to the speech utterance; and sending, to the first speech interface device, the response data. - View Dependent Claims (2, 3, 4)
-
-
5. A method comprising:
-
receiving, at a remote speech processing system and from a first device, first audio data representing speech; receiving, at the remote speech processing system and from a second device, second audio data representing the speech; receiving, from the first device, first metadata; receiving, from the second device, second metadata; determining a speechlet to generate a response to the speech; sending, to the speechlet, a first device identifier of the first device; sending, to the speechlet, a second device identifier of the second device; sending the first metadata to the speechlet; sending the second metadata to the speechlet; receiving, from the speechlet, the first device identifier indicating the first device was selected by the speechlet to perform the response; receiving, from the speechlet, response data corresponding to the response to the speech; and sending, from the remote speech processing system and to the first device, the response data to cause the first device to perform the response. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A system comprising:
-
one or more processors; computer-readable media storing computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising; receiving data representing an intent of a speech utterance captured by a client device; receiving a first device identifier indicating a first device that generated first audio data representing the speech utterance; receiving a second device identifier indicating a second device that generated second audio data representing the speech utterance; receiving first data indicating a first likelihood that the first device is to perform an action responsive to the speech utterance; receiving second data indicating a second likelihood that the second device is to perform the action responsive to the speech utterance; determining a first device state for the first device; determining a second device state for the second device; and determining, based at least in part on the first device state and at least one of the first data or the second data, that the first device is to perform the action responsive to the speech utterance. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification