Configurable speech recognition system using a pronunciation alignment between multiple recognizers

US 10,032,455 B2
Filed: 01/06/2012
Issued: 07/24/2018
Est. Priority Date: 01/07/2011
Status: Active Grant

First Claim

Patent Images

1. A method of training an embedded speech recognizer on an electronic device in a distributed speech recognition system comprising the electronic device and a network device having a remote speech recognizer remote from the electronic device, the method comprising:

recognizing, by the embedded speech recognizer, at least a first portion of input audio received by the electronic device to generate a local speech recognition result, wherein the recognizing is performed, at least in part, using a command grammar activated by the embedded speech recognizer in response to recognizing a command;

sending, to the network device, at least a second portion of input audio received by the electronic device;

receiving, from the network device, a remote speech recognition result corresponding to the at least a second portion of the input audio;

performing, at the electronic device, a pronunciation alignment of the local speech recognition result and the remote speech recognition result;

identifying, based on the aligned local and remote speech recognition results, a portion of the remote speech recognition result corresponding to a low-confidence part of the local speech recognition result; and

training the embedded speech recognizer based, at least in part, on the remote speech recognition result, wherein training the embedded speech recognizer comprises adding the identified portion of the remote speech recognition result to the command grammar used by the embedded speech recognizer to recognize the at least a portion of the input audio.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for combining the results of multiple recognizers in a distributed speech recognition architecture. Speech data input to a client device is encoded and processed both locally and remotely by different recognizers configured to be proficient at different speech recognition tasks. The client/server architecture is configurable to enable network providers to specify a policy directed to a trade-off between reducing recognition latency perceived by a user and usage of network resources. The results of the local and remote speech recognition engines are combined based, at least in part, on logic stored by one or more components of the client/server architecture.

202 Citations

21 Claims

1. A method of training an embedded speech recognizer on an electronic device in a distributed speech recognition system comprising the electronic device and a network device having a remote speech recognizer remote from the electronic device, the method comprising:
- recognizing, by the embedded speech recognizer, at least a first portion of input audio received by the electronic device to generate a local speech recognition result, wherein the recognizing is performed, at least in part, using a command grammar activated by the embedded speech recognizer in response to recognizing a command;
  
  sending, to the network device, at least a second portion of input audio received by the electronic device;
  
  receiving, from the network device, a remote speech recognition result corresponding to the at least a second portion of the input audio;
  
  performing, at the electronic device, a pronunciation alignment of the local speech recognition result and the remote speech recognition result;
  
  identifying, based on the aligned local and remote speech recognition results, a portion of the remote speech recognition result corresponding to a low-confidence part of the local speech recognition result; and
  
  training the embedded speech recognizer based, at least in part, on the remote speech recognition result, wherein training the embedded speech recognizer comprises adding the identified portion of the remote speech recognition result to the command grammar used by the embedded speech recognizer to recognize the at least a portion of the input audio.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein training the embedded speech recognizer further comprises:
    - updating at least one recognition vocabulary associated with the embedded speech recognizer based, at least in part, on the remote speech recognition result.
  - 3. The method of claim 2, wherein updating the at least one recognition vocabulary comprises adding one or more words associated with the remote speech recognition result to the at least one recognition vocabulary.
  - 4. The method of claim 1, further comprising:
    - determining whether the at least a first portion of the input audio recognized by the embedded speech recognizer is associated with a confidence value below a predetermined threshold;
      
      wherein sending to the network device, at least a second portion of input audio received by the electronic device comprises sending, in response to determining that the at least a first portion of the input audio recognized by the embedded speech recognizer is associated with a confidence value below a predetermined threshold, the at least a second portion of the input audio to the network device for recognition by the remote speech recognizer.
  - 5. The method of claim 4, wherein training the embedded speech recognizer comprises updating at least one recognition vocabulary associated with the embedded speech recognizer.
  - 6. The method of claim 1, further comprising:
    - determining, a number of times the remote speech recognition result has been received from the network device; and
      
      training the embedded speech recognizer only when the number of times the remote speech recognition result has been received from the network device exceeds a predetermined threshold.
  - 7. The method of claim 1, wherein the command grammar is a call grammar that includes a call command node and a call recipient node or a messaging grammar that includes a message command node and a message recipient node.
  - 8. The method of claim 1, further comprising:
    - storing one or more statistics detailing a usage of pronunciations or grammatical forms spoken by a user of the electronic device, wherein training the embedded speech recognizer further comprises training the embedded speech recognizer based, at least in part, on the one or more statistics.

9. A non-transitory computer-readable storage medium encoded with a plurality of instructions that, when executed by at least one processor on an electronic device in a distributed speech recognition system comprising the electronic device having an embedded speech recognizer and a network device having a remote speech recognizer remote from the electronic device, perform a method comprising:
- recognizing, by the embedded speech recognizer, at least a first portion of input audio received by the electronic device to generate a local speech recognition result, wherein the recognizing is performed, at least in part, using a command grammar activated by the embedded speech recognizer in response to recognizing a command;
  
  sending, to the network device, at least a second portion of input audio received by the electronic device;
  
  receiving, from the network device, a remote speech recognition result corresponding to the at least a second portion of the input audio;
  
  performing, at the electronic device, a pronunciation alignment of the local speech recognition result and the remote speech recognition result;
  
  identifying, based on the aligned local and remote speech recognition results, a portion of the remote speech recognition result corresponding to a low-confidence part of the local speech recognition result; and
  
  training the embedded speech recognizer based, at least in part, on the remote speech recognition result, wherein training the embedded speech recognizer comprises adding the identified portion of the remote speech recognition result to the command grammar used by the embedded speech recognizer to recognize the at least a portion of the input audio.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The computer-readable storage medium of claim 9, wherein training the embedded speech recognizer further comprises:
    - updating at least one recognition vocabulary associated with the embedded speech recognizer based, at least in part, on the remote speech recognition result.
  - 11. The computer-readable storage medium of claim 10, wherein updating the at least one recognition vocabulary comprises adding one or more words associated with the remote speech recognition result to the at least one recognition vocabulary.
  - 12. The computer-readable storage medium of claim 9, wherein the method further comprises:
    - determining, a number of times the remote speech recognition result has been received from the network device; and
      
      training the embedded speech recognizer only when the number of times the remote speech recognition result has been received from the network device exceeds a predetermined threshold.
  - 13. The computer-readable storage medium of claim 9, wherein the method further comprises:
    - determining whether the at least a first portion of the input audio recognized by the embedded speech recognizer is associated with a confidence value below a predetermined threshold;
      
      wherein sending to the network device, at least a second portion of input audio received by the electronic device comprises sending, in response to determining that the at least a first portion of the input audio recognized by the embedded speech recognizer is associated with a confidence value below a predetermined threshold, the at least a second portion of the input audio to the network device for recognition by the remote speech recognizer.
  - 14. The computer-readable storage medium of claim 9, wherein the command grammar is a call grammar that includes a call command node and a call recipient node or a messaging grammar that includes a message command node and a message recipient node.
  - 15. The computer-readable storage medium of claim 9, wherein the method further comprises:
    - storing one or more statistics detailing a usage of pronunciations or grammatical forms spoken by a user of the electronic device, wherein training the embedded speech recognizer further comprises training the embedded speech recognizer based, at least in part, on the one or more statistics.

16. An electronic device for use in a distributed speech recognition system comprising the electronic device and a network device remote from the electronic device, the electronic device, comprising:
- at least one storage device configured to store information associated with input audio spoken by a user of the electronic device;
  
  an embedded speech recognizer configured to recognize at least a first portion of input audio comprising speech to produce a local speech recognition result, wherein the recognizing is performed, at least in part, using a command grammar activated by the embedded speech recognizer in response to recognizing a command; and
  
  at least one processor programmed to;
  
  send to the network device, at least a second portion of input audio received by the electronic device;
  
  receive, from the network device, a remote speech recognition result corresponding to the at least a second portion of the input audio;
  
  perform a pronunciation alignment of the local speech recognition result and the remote speech recognition result;
  
  identify, based on the aligned local and remote speech recognition results, a portion of the remote speech recognition result corresponding to a low-confidence part of the local speech recognition result; and
  
  train the embedded speech recognizer based, at least in part, on the remote speech recognition result, wherein training the embedded speech recognizer comprises adding the identified portion of the remote speech recognition result to the command grammar used by the embedded speech recognizer to recognize the at least a portion of the input audio.
- View Dependent Claims (17, 18, 19, 20, 21)
- - 17. The electronic device of claim 16, wherein training the embedded speech recognizer further comprises:
    - updating at least one recognition vocabulary associated with the embedded speech recognizer based, at least in part, on the remote speech recognition result.
  - 18. The electronic device of claim 17, wherein updating the at least one recognition vocabulary comprises adding one or more words associated with the remote speech recognition result to the at least one recognition vocabulary.
  - 19. The electronic device of claim 16, the at least one processor is further programmed to:
    - determine, a number of times the remote speech recognition result has been received from the network device; and
      
      train the embedded speech recognizer only when the number of times the remote speech recognition result has been received from the network device exceeds a predetermined threshold.
  - 20. The electronic device of claim 16, wherein the command grammar is a call grammar that includes a call command node and a call recipient node or a messaging grammar that includes a message command node and a message recipient node.
  - 21. The electronic device of claim 16, the at least one processor is further programmed to:
    - store one or more statistics detailing a usage of pronunciations or grammatical forms spoken by a user of the electronic device, wherein training the embedded speech recognizer further comprises training the embedded speech recognizer based, at least in part, on the one or more statistics.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Newman, Michael, Gillet, Anthony, Krowitz, David Mark, Edgington, Michael D.
Primary Examiner(s)
Kazeminezhad, Farzad

Application Number

US13/345,198
Publication Number

US 20120179469A1
Time in Patent Office

2,391 Days
Field of Search

704246, 704231
US Class Current
CPC Class Codes

G10L 15/22   Procedures used during a sp...

G10L 15/30   Distributed recognition, e....

G10L 15/32   Multiple recognisers used i...

G10L 17/00   Speaker identification or v...

Configurable speech recognition system using a pronunciation alignment between multiple recognizers

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

202 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Configurable speech recognition system using a pronunciation alignment between multiple recognizers

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

202 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links