Device selection for providing a response

US 9,875,081 B2
Filed: 09/21/2015
Issued: 01/23/2018
Est. Priority Date: 09/21/2015
Status: Active Grant

First Claim

Patent Images

1. A system, comprising;

a first speech processing pipeline instance that receives a first audio signal from a first speech interface device, the first audio signal representing a speech utterance, the first speech processing pipeline instance also receiving a first timestamp indicating a first time at which a wakeword was detected by the first speech interface device;

a second speech processing pipeline instance that receives a second audio signal from a second speech interface device, the second audio signal representing the speech utterance, the second speech processing pipeline also receiving a second timestamp indicating a second time at which the wakeword was detected by the second speech interface device;

the first speech processing pipeline instance having a series of processing components comprising;

an automatic speech recognition (ASR) component configured to analyze the first audio signal to determine words of the speech utterance;

a natural language understanding (NLU) component positioned in the first speech processing pipeline instance after the ASR component, the NLU component being configured to analyze the words of the speech utterance to determine an intent expressed by the speech utterance;

a response dispatcher positioned in the first speech processing pipeline instance after the NLU component, the response dispatcher being configured to specify a speech response to the speech utterance;

a first source arbiter positioned in the first speech processing pipeline instance before the ASR component, the first source arbiter being configured to determine (a) that an amount of time represented by a difference between the first timestamp and the second timestamp is less than a threshold;

(b) to determine that the first timestamp is greater than the second timestamp; and

(c) to abort the first speech processing pipeline instance.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system may use multiple speech interface devices to interact with a user by speech. All or a portion of the speech interface devices may detect a user utterance and may initiate speech processing to determine a meaning or intent of the utterance. Within the speech processing, arbitration is employed to select one of the multiple speech interface devices to respond to the user utterance. Arbitration may be based in part on metadata that directly or indirectly indicates the proximity of the user to the devices, and the device that is deemed to be nearest the user may be selected to respond to the user utterance.

Citations

19 Claims

1. A system, comprising;
- a first speech processing pipeline instance that receives a first audio signal from a first speech interface device, the first audio signal representing a speech utterance, the first speech processing pipeline instance also receiving a first timestamp indicating a first time at which a wakeword was detected by the first speech interface device;
  
  a second speech processing pipeline instance that receives a second audio signal from a second speech interface device, the second audio signal representing the speech utterance, the second speech processing pipeline also receiving a second timestamp indicating a second time at which the wakeword was detected by the second speech interface device;
  
  the first speech processing pipeline instance having a series of processing components comprising;
  
  an automatic speech recognition (ASR) component configured to analyze the first audio signal to determine words of the speech utterance;
  
  a natural language understanding (NLU) component positioned in the first speech processing pipeline instance after the ASR component, the NLU component being configured to analyze the words of the speech utterance to determine an intent expressed by the speech utterance;
  
  a response dispatcher positioned in the first speech processing pipeline instance after the NLU component, the response dispatcher being configured to specify a speech response to the speech utterance;
  
  a first source arbiter positioned in the first speech processing pipeline instance before the ASR component, the first source arbiter being configured to determine (a) that an amount of time represented by a difference between the first timestamp and the second timestamp is less than a threshold;
  
  (b) to determine that the first timestamp is greater than the second timestamp; and
  
  (c) to abort the first speech processing pipeline instance.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein:
    - the first speech processing pipeline instance receives the first audio signal subsequent to the ASR component analyzing the first audio signal; and
      
      the series of processing components comprises a second source arbiter positioned in the first speech processing pipeline instance after the ASR component, the second source arbiter being configured (a) to determine that the amount of time represented by the difference between the first timestamp and the second timestamp is less than the threshold;
      
      (b) to determine that the first timestamp is greater than the second timestamp; and
      
      (c) to abort the first speech processing pipeline instance.
  - 3. The system of claim 1, the system being configured to send, to the first speech interface device, an indication that the first speech interface device will not respond to the utterance.
  - 4. The system of claim 3, wherein the indication includes data causing the first speech interface device to stop providing the first audio signal to the first speech processing pipeline instance and to enter a listening mode in which the first speech interface device detects a further utterance of the wakeword.
  - 5. The system of claim 1, wherein:
    - the first speech processing pipeline instance also receives a first signal attribute of the first audio signal, wherein the first signal attribute indicates one or more of;
      
      a level of voice presence detected in the first audio signal;
      
      a confidence with which a wakeword was detected by the first speech interface device;
      
      an amplitude of the first audio signal;
      
      a signal-to-noise measurement of the first audio signal;
      
      ora distance of a user from the first speech interface device;
      
      the second speech processing pipeline instance also receives a second signal attribute of the second audio signal, wherein the second signal attribute indicates one or more of;
      
      a level of voice presence detected in the second audio signal;
      
      a confidence with which the wakeword was detected by the second speech interface device;
      
      an amplitude of the second audio signal;
      
      a second signal-to-noise measurement of the second audio signal;
      
      ora distance of the user from the second speech interface device; and
      
      the first source arbiter is further configured to compare the first signal attribute to the second signal attribute to (a) determine that the user is more proximate the second user interface device than the first user interface device and (b) abort the first speech processing pipeline instance.

6. A method, comprising:
- receiving, by a first speech processing pipeline, a first digital audio signal produced by a first device after the first device detects a wakeword, the first speech processing pipeline including a first series of speech processing components that process the first digital audio signal;
  
  receiving, by a second speech processing pipeline, a second digital audio signal produced by a second device after the second device detects the wakeword, the second speech processing pipeline including a second series of speech processing components that process the second digital audio signal;
  
  receiving one or more first attributes associated with the first digital audio signal;
  
  receiving one or more second attributes associated with the second digital audio signal;
  
  determining that the first digital audio signal represents an utterance;
  
  determining that the second digital audio signal represents the utterance based at least in part on the first digital audio signal being received within a threshold amount of time of the second digital audio signal being received;
  
  determining, based at least in part on the one or more first attributes and the one or more second attributes, that the first speech processing pipeline will process the first digital audio signal;
  
  determining, based at least in part on the one or more first attributes and the one or more second attributes, that the first device will respond to the utterance; and
  
  sending, to the first device, audio data representing a speech response to the utterance.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14)
- - 7. The method of claim 6, further comprising sending, to the first device, data that specifies speech to be produced by the first device.
  - 8. The method of claim 7, further comprising sending, to the second device, data including an instruction that results in the first device entering a listening mode.
  - 9. The method of claim 6, further comprising receiving configuration information indicating an association between the first device and the second device.
  - 10. The method of claim 6, further comprising determining that the first device and the second device are associated with a user account.
  - 11. The method of claim 6, further comprising:
    - performing automatic speech recognition (ASR) on the first digital audio signal to determine one or more words of the utterance;
      
      performing natural language understanding (NLU) on the one or more words of the utterance to determine an intent expressed by the utterance.
  - 12. The method of claim 6, wherein receiving the one or more first attributes comprises receiving a proximity of a user relative to the first device.
  - 13. The method of claim 6, wherein determining that the first device will respond to the utterance comprises one or more of:
    - determining which of the first digital audio signal and the second digital audio signal has a higher amplitude;
      
      determining which of the first device and the second device detects a higher level of voice presence;
      
      determining which of the first digital audio signal and the second digital audio signal has a higher signal-to-noise measurement;
      
      determining which of the first device and the second device detects a trigger expression with a higher level of confidence;
      
      determining which of the first device and the second device first detects the trigger expression;
      
      determining which of the first device and the second device has a capability;
      
      determining within which of the first digital audio signal and the second digital audio signal words are recognized with a higher level of confidence;
      
      ordetermining within which of the first digital audio signal and the second digital audio signal an intent expressed by the words is determined with a higher level of confidence.
  - 14. The method of claim 6, wherein determining that the first device will respond to the utterance comprises determining that a first time associated by the first device with the utterance is prior to second time associated by the second device with the utterance.

15. A system, comprising:
- one or more processors;
  
  one or more non-transitory computer-readable media storing computer-executable instructions that, when executed on the one or more processors, cause the one or more processors to perform actions comprising;
  
  receiving, by a first speech processing pipeline, a first digital audio signal produced by a first device after the first device detects a wakeword, the first speech processing pipeline including a first series of speech processing components that process the first digital audio signal;
  
  receiving, by a second speech processing pipeline, a second digital audio signal produced by a second device after the second device detects the wakeword, the second speech processing pipeline includes a second series of speech processing components that process the second digital audio signal;
  
  receiving a first attribute associated with the first digital audio signal;
  
  receiving a second attributed associated with the second digital audio signal;
  
  determining that the first digital audio signal represents an utterance;
  
  determining that the second digital audio signal represents the utterance based at least in part on the first digital audio signal being received within a threshold amount of time of the second digital audio signal being received;
  
  determining, based at least in part on the one or more first attributes and the one or more second attributes, that the first speech processing pipeline will process the first digital audio signal;
  
  determining, based at least in part on the first attribute and the second attribute, that the first device will respond to the utterance; and
  
  sending, to the first device, audio data representing a speech response to the utterance.
- View Dependent Claims (16, 17, 18, 19)
- - 16. The system of claim 15, wherein determining that the second digital audio signal represents the utterance comprises calculating a cross-correlation between the first digital audio signal and the second digital audio signal.
  - 17. The system of claim 15, wherein determining that the second digital audio signal represents the utterance comprises determining that the first digital audio signal and the second digital audio signal represent matching sequences of words.
  - 18. The system of claim 15, wherein determining that the second digital audio signal represents the utterance comprises:
    - determining that the first digital audio signal represents first user speech;
      
      determining that the second digital audio signal represents second user speech; and
      
      determining that the first user speech and the second user speech correspond to a common intent.
  - 19. The system of claim 15, wherein the determining that the first device will respond to the utterance comprises one or more of:
    - determining which of the first device and the second device is physically nearer a user;
      
      determining which of the first digital audio signal and the second digital audio signal has a higher signal amplitude;
      
      determining which of the first digital audio signal and the second digital audio signal has a higher signal-to-noise measurement;
      
      determining which of the first digital audio signal and the second digital audio signal represents a higher level of voice presence;
      
      determining which of the first device and the second device first receives a response to the utterance; and
      
      determining which of the first and second devices first receives the utterance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Meyers, James David, Pravinchandra, Shah Samir, Liu, Yue, Dean, Arlen, Miller, Daniel, Mandal, Arindam
Primary Examiner(s)
Sharma, Neeraj

Application Number

US14/860,400
Publication Number

US 20170083285A1
Time in Patent Office

855 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 3/167   Audio in a user interface, ...

G10L 15/00   Speech recognition G10L17/0...

G10L 15/063   Training

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/22   Procedures used during a sp...

G10L 15/222   Barge in, i.e. overridable ...

G10L 15/26   Speech to text systems G10L...

G10L 15/32   Multiple recognisers used i...

G10L 2015/088   Word spotting

G10L 2015/223   Execution procedure of a sp...

G10L 2015/226   using non-speech characteri...

Device selection for providing a response

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Device selection for providing a response

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links