Configurable speech recognition system using a pronunciation alignment between multiple recognizers
First Claim
1. A method of training an embedded speech recognizer on an electronic device in a distributed speech recognition system comprising the electronic device and a network device having a remote speech recognizer remote from the electronic device, the method comprising:
- recognizing, by the embedded speech recognizer, at least a first portion of input audio received by the electronic device to generate a local speech recognition result, wherein the recognizing is performed, at least in part, using a command grammar activated by the embedded speech recognizer in response to recognizing a command;
sending, to the network device, at least a second portion of input audio received by the electronic device;
receiving, from the network device, a remote speech recognition result corresponding to the at least a second portion of the input audio;
performing, at the electronic device, a pronunciation alignment of the local speech recognition result and the remote speech recognition result;
identifying, based on the aligned local and remote speech recognition results, a portion of the remote speech recognition result corresponding to a low-confidence part of the local speech recognition result; and
training the embedded speech recognizer based, at least in part, on the remote speech recognition result, wherein training the embedded speech recognizer comprises adding the identified portion of the remote speech recognition result to the command grammar used by the embedded speech recognizer to recognize the at least a portion of the input audio.
3 Assignments
0 Petitions
Accused Products
Abstract
Techniques for combining the results of multiple recognizers in a distributed speech recognition architecture. Speech data input to a client device is encoded and processed both locally and remotely by different recognizers configured to be proficient at different speech recognition tasks. The client/server architecture is configurable to enable network providers to specify a policy directed to a trade-off between reducing recognition latency perceived by a user and usage of network resources. The results of the local and remote speech recognition engines are combined based, at least in part, on logic stored by one or more components of the client/server architecture.
202 Citations
21 Claims
-
1. A method of training an embedded speech recognizer on an electronic device in a distributed speech recognition system comprising the electronic device and a network device having a remote speech recognizer remote from the electronic device, the method comprising:
-
recognizing, by the embedded speech recognizer, at least a first portion of input audio received by the electronic device to generate a local speech recognition result, wherein the recognizing is performed, at least in part, using a command grammar activated by the embedded speech recognizer in response to recognizing a command; sending, to the network device, at least a second portion of input audio received by the electronic device; receiving, from the network device, a remote speech recognition result corresponding to the at least a second portion of the input audio; performing, at the electronic device, a pronunciation alignment of the local speech recognition result and the remote speech recognition result; identifying, based on the aligned local and remote speech recognition results, a portion of the remote speech recognition result corresponding to a low-confidence part of the local speech recognition result; and training the embedded speech recognizer based, at least in part, on the remote speech recognition result, wherein training the embedded speech recognizer comprises adding the identified portion of the remote speech recognition result to the command grammar used by the embedded speech recognizer to recognize the at least a portion of the input audio. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A non-transitory computer-readable storage medium encoded with a plurality of instructions that, when executed by at least one processor on an electronic device in a distributed speech recognition system comprising the electronic device having an embedded speech recognizer and a network device having a remote speech recognizer remote from the electronic device, perform a method comprising:
-
recognizing, by the embedded speech recognizer, at least a first portion of input audio received by the electronic device to generate a local speech recognition result, wherein the recognizing is performed, at least in part, using a command grammar activated by the embedded speech recognizer in response to recognizing a command; sending, to the network device, at least a second portion of input audio received by the electronic device; receiving, from the network device, a remote speech recognition result corresponding to the at least a second portion of the input audio; performing, at the electronic device, a pronunciation alignment of the local speech recognition result and the remote speech recognition result; identifying, based on the aligned local and remote speech recognition results, a portion of the remote speech recognition result corresponding to a low-confidence part of the local speech recognition result; and training the embedded speech recognizer based, at least in part, on the remote speech recognition result, wherein training the embedded speech recognizer comprises adding the identified portion of the remote speech recognition result to the command grammar used by the embedded speech recognizer to recognize the at least a portion of the input audio. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. An electronic device for use in a distributed speech recognition system comprising the electronic device and a network device remote from the electronic device, the electronic device, comprising:
-
at least one storage device configured to store information associated with input audio spoken by a user of the electronic device; an embedded speech recognizer configured to recognize at least a first portion of input audio comprising speech to produce a local speech recognition result, wherein the recognizing is performed, at least in part, using a command grammar activated by the embedded speech recognizer in response to recognizing a command; and at least one processor programmed to; send to the network device, at least a second portion of input audio received by the electronic device; receive, from the network device, a remote speech recognition result corresponding to the at least a second portion of the input audio; perform a pronunciation alignment of the local speech recognition result and the remote speech recognition result; identify, based on the aligned local and remote speech recognition results, a portion of the remote speech recognition result corresponding to a low-confidence part of the local speech recognition result; and train the embedded speech recognizer based, at least in part, on the remote speech recognition result, wherein training the embedded speech recognizer comprises adding the identified portion of the remote speech recognition result to the command grammar used by the embedded speech recognizer to recognize the at least a portion of the input audio. - View Dependent Claims (17, 18, 19, 20, 21)
-
Specification