Local speech recognition of frequent utterances

US 9,070,367 B1
Filed: 11/26/2012
Issued: 06/30/2015
Est. Priority Date: 11/26/2012
Status: Expired due to Fees

First Claim

Patent Images

1. A system for performing speech recognition comprising a local device and a remote device, the system configured to perform actions comprising:

receiving a plurality of spoken utterances by a local device during a period of use of the local device;

determining a first frequently spoken utterance and a second frequently spoken utterance from the plurality of spoken utterances, wherein the determining is based on a number of times each of the first frequently spoken utterance and the second frequently spoken utterance were received by the local device during the period of use;

creating a first model for the first frequently spoken utterance and a second model for the second frequently spoken utterance;

receiving a first spoken utterance by the local device;

sending a representation of the first spoken utterance from the local device to a remote device;

determining, by the local device, that the first spoken utterance corresponds to the first frequently spoken utterance, wherein the determining is based at least in part on the first model and the second model;

sending, by the local device, a cancellation request to the remote device in response to determining, by the local device, that the first spoken utterance corresponds to the first frequently spoken utterance, wherein the cancellation request indicates that the remote device need not perform speech recognition on the representation of the first spoken utterance;

performing an action corresponding to the first spoken utterance;

receiving a second spoken utterance by the local device;

determining, by the local device, that the second spoken utterance does not correspond to the first frequently spoken utterance and that the second spoken utterance does not correspond to the second frequently spoken utterance, wherein the determining is based at least in part on the first model and the second model;

sending a representation of the second spoken utterance from the local device to the remote device;

performing speech recognition on the representation of the second spoken utterance by the remote device; and

performing an action corresponding to the second spoken utterance.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a distributed automated speech recognition (ASR) system, speech models may be employed on a local device to allow the local device to process frequently spoken utterances while passing other utterances to a remote device for processing. Upon receiving an audio signal, the local device compares the audio signal to the speech models of the frequently spoken utterances to determine whether the audio signal matches one of the speech models. When the audio signal matches one of the speech models, the local device processes the utterance, for example by executing a command. When the audio signal does not match one of the speech models, the local device transmits the audio signal to a second device for ASR processing. This reduces latency and the amount of audio signals that are sent to the second device for ASR processing.

Citations

20 Claims

1. A system for performing speech recognition comprising a local device and a remote device, the system configured to perform actions comprising:
- receiving a plurality of spoken utterances by a local device during a period of use of the local device;
  
  determining a first frequently spoken utterance and a second frequently spoken utterance from the plurality of spoken utterances, wherein the determining is based on a number of times each of the first frequently spoken utterance and the second frequently spoken utterance were received by the local device during the period of use;
  
  creating a first model for the first frequently spoken utterance and a second model for the second frequently spoken utterance;
  
  receiving a first spoken utterance by the local device;
  
  sending a representation of the first spoken utterance from the local device to a remote device;
  
  determining, by the local device, that the first spoken utterance corresponds to the first frequently spoken utterance, wherein the determining is based at least in part on the first model and the second model;
  
  sending, by the local device, a cancellation request to the remote device in response to determining, by the local device, that the first spoken utterance corresponds to the first frequently spoken utterance, wherein the cancellation request indicates that the remote device need not perform speech recognition on the representation of the first spoken utterance;
  
  performing an action corresponding to the first spoken utterance;
  
  receiving a second spoken utterance by the local device;
  
  determining, by the local device, that the second spoken utterance does not correspond to the first frequently spoken utterance and that the second spoken utterance does not correspond to the second frequently spoken utterance, wherein the determining is based at least in part on the first model and the second model;
  
  sending a representation of the second spoken utterance from the local device to the remote device;
  
  performing speech recognition on the representation of the second spoken utterance by the remote device; and
  
  performing an action corresponding to the second spoken utterance.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein determining the first frequently spoken utterance comprises counting a number of instances of each utterance of the plurality of spoken utterances and selecting an utterance with a largest count.
  - 3. The method of claim 1, wherein creating the first model for the first frequently spoken utterance comprises creating a hidden Markov model and wherein determining, by the local device, that the first spoken utterance corresponds to the first frequently spoken utterance comprises computing a score using a Viterbi algorithm.
  - 4. The method of claim 3, wherein performing the action corresponding to the first spoken utterance comprises playing music.

5. A computer-implemented method, comprising:
- receiving a plurality of spoken utterances during a period of use of a local device;
  
  storing, by the local device, a speech model corresponding to a frequently spoken utterance, the frequently spoken utterance comprising one of the plurality of spoken utterances and being determined based on a number of times the frequently spoken utterance was received by the local device during the period of use;
  
  receiving, by the local device, first audio data comprising first speech;
  
  transmitting, by the local device, a representation of the first audio data to a remote device;
  
  determining, by the local device, that the first speech includes the frequently spoken utterance based at least in part on the speech model;
  
  sending, by the local device, a cancellation request to the remote device in response to determining that the first speech includes the frequently spoken utterance, wherein the cancellation request indicates that the remote device need not perform speech recognition on the representation of the first audio data;
  
  receiving, by the local device, second audio data comprising second speech;
  
  determining, by the local device, that the second speech does not include the frequently spoken utterance; and
  
  transmitting, by the local device, a representation of the second audio data to the remote device for processing, wherein the remote device performs speech recognition on the second audio data.
- View Dependent Claims (6, 7, 8, 9, 10)
- - 6. The method of claim 5, wherein the frequently spoken utterance comprises a command, and wherein the method further comprises executing the command.
  - 7. The method of claim 5, further comprising:
    - receiving speech recognition results from the remote device, wherein the speech recognition results correspond to the representation of the second audio data; and
      
      executing a command, wherein the second speech comprises the command.
  - 8. The method of claim 5, wherein the determining that the first speech includes the frequently spoken utterance comprises comparing a representation of the first audio data to the speech model corresponding to the frequently spoken utterance.
  - 9. The method of claim 5, wherein the representation of the second audio data comprises one of a portion of the second audio data or feature vectors computed from at least a portion of the second audio data.
  - 10. The method of claim 5, further comprising:
    - receiving a second plurality of utterances during a second period of use of the local device; and
      
      updating the speech model based on a number of times a second frequently spoken utterance was received by the local device, wherein the second frequently spoken utterance is one of the second plurality of utterances.

11. A computing device, comprising:
- at least one processor;
  
  a memory device including instructions operable to be executed by the at least one processor to perform a set of actions, configuring the processor;
  
  to receive a plurality of spoken utterances during a period of use of the device;
  
  to store a speech model corresponding to a frequently spoken utterance, the frequently spoken utterance being one of the plurality of spoken utterances and being determined based on a number of times the frequently spoken utterance was received by the device during the period of use;
  
  to receive first audio data comprising first speech;
  
  to transmit a representation of the first audio data to a remote device;
  
  to determine that the first speech includes the frequently spoken utterance based at least in part on the speech model;
  
  to send a cancellation request to the remote device in response to determining that the first speech includes the frequently spoken utterance, wherein the cancellation request indicates that the remote device need not perform speech recognition on the representation of the first audio data;
  
  to receive second audio data comprising second speech;
  
  to determine that the second speech does not include the frequently spoken utterance; and
  
  to transmit a representation of the second audio data to the remote device for processing, wherein the remote device performs speech recognition on the second audio data.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The computing device of claim 11, wherein the frequently spoken utterance comprises a command, and wherein the at least one processor is further configured to execute the command.
  - 13. The computing device of claim 11, wherein the at least one processor is further configured:
    - to receive speech recognition results from the remote device, wherein the speech recognition results correspond to the representation of the second audio data; and
      
      to executing a command, wherein the second speech comprises the command.
  - 14. The computing device of claim 11, wherein the at least one processor is further configured to determine that the first speech includes the frequently spoken utterance by comparing a representation of the first audio data to the speech model.
  - 15. The computing device of claim 11, wherein the at least one processor is further configured:
    - to receive a second plurality of utterances during a second period of use of the local device; and
      
      to update the speech model based on a number of times a second frequently spoken utterance was received by the local device, wherein the second frequently spoken utterance is one of the second plurality of utterances.

16. A non-transitory computer-readable storage medium storing processor-executable instructions for controlling a computing device, comprising:
- program code to receive a plurality of spoken utterances during a period of use of the device;
  
  program code to store a speech model corresponding to a frequently spoken utterance, the frequently spoken utterance being one of the plurality of spoken utterances and being determined based on a number of times the frequently spoken utterance was received by the device during the period of use;
  
  program code to receive first audio data comprising first speech;
  
  program code to transmit a representation of the first audio data to a remote device;
  
  program code to determine that the first speech includes the frequently spoken utterance based at least in part on the speech model;
  
  program code to send a cancellation request to the remote device in response to determining that the first speech includes the frequently spoken utterance, wherein the cancellation request indicates that the remote device need not perform speech recognition on the representation of the first audio data;
  
  program code to receive second audio data comprising second speech;
  
  program code to determine that the second speech does not include the frequently spoken utterance; and
  
  program code to transmit a representation of the second audio data to the remote device for processing, wherein the remote device performs speech recognition on the second audio data.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The non-transitory computer-readable storage medium of claim 16, wherein the frequently spoken utterance comprises a command, and the non-transitory computer-readable storage medium further comprises program code to execute the command.
  - 18. The non-transitory computer-readable storage medium of claim 16, further comprising:
    - program code to receive speech recognition results from the remote device, wherein the speech recognition results correspond to the representation of the second audio data; and
      
      program code to execute a command, wherein the second speech comprises the command.
  - 19. The non-transitory computer-readable storage medium of claim 16, wherein the program code to determine that the first speech includes the frequently spoken utterance includes program code to compare a representation of the first audio data to the speech model.
  - 20. The non-transitory computer-readable storage medium of claim 16, further comprising:
    - program code to receive a second plurality of utterances during a second period of use of the local device; and
      
      program code to update the speech model based on a number of times a second frequently spoken utterance was received by the local device, wherein the second frequently spoken utterance is one of the second plurality of utterances.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Hoffmeister, Bjorn, O'Neill, Jeffrey
Primary Examiner(s)
VO, HUYEN X

Application Number

US13/684,969
Time in Patent Office

946 Days
Field of Search

704 1- 10, 704/243, 704/235, 704/244, 704/255, 704/256, 704/270, 704/251, 704/231, 704/250, 704/270.1, 704/254
US Class Current

1/1
CPC Class Codes

G10L 15/063   Training

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/30   Distributed recognition, e....

Local speech recognition of frequent utterances

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Local speech recognition of frequent utterances

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links