MULTIMODAL DISAMBIGUATION OF SPEECH RECOGNITION
First Claim
1. A computer-implemented method for processing language input in a system that includes a server and mobile computer, the mobile computer including a microphone and a display and a text input device operable by a user, the method comprising operations of:
- (a) responsive to the mobile computing device receiving via the microphone voice input comprising multiple discrete utterances from a user, converting the voice input into a digital sequence of vectors and then wirelessly transmitting the digital sequence of vectors to the server;
(b) the server creating an initial N-best list of words corresponding to each of the utterances by conducting, speech recognition operations including matching the vectors to potential phonemes and matching the phonemes against a lexicon model and a language model, the operation of creating each initial N-best list of words further considering context of the corresponding utterance with respect to words of N-best lists corresponding to others of the received utterances, said context including subject-verb agreement, proper case, proper gender, and numerical agreement;
(c) the server transmitting each of the initial N-best lists of words to the mobile computing device;
(d) for each of said utterances, the mobile computing device visually displaying a best word from the initial N-best list of words corresponding to said utterance;
(e) responsive to implied or explicit user selection of one of the displayed best words, said selected word being from a given N-best list of words corresponding to a given utterance, causing the display to present additional words from the given initial N-best list of words;
(f) during said presentation of the additional words, the mobile computing device receiving via the text input device hand-entered input from a user, and responsive to said text input, constraining said presentation of the additional words to exclude words of the given initial N-best list that are inconsistent with the textual input;
(g) responsive to said presentation of the additional words being constrained to a resultant word, displaying the resultant word instead of the selected word and transmitting the resultant word to the server;
(h) responsive to receiving the resultant word, the server updating the initial N-best lists of others of the utterances besides the given utterance to provide subject-verb agreement, employ proper case, use proper gender, and exhibit numerical agreement when considered in context of the resultant word, and transmitting the updated N-best lists to the mobile computing device;
(i) for each of the utterances having an updated N-best list, the mobile computing device causing the display to present a best word of the updated N-best list of words for that utterance; and
(j) for each of the utterances without an updated N-best list, the mobile computing device causing the display to present a best word of the initial N-best list of words for said utterance.
7 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides a speech recognition system combined with one or more alternate input modalities to ensure efficient and accurate text input. The speech recognition system achieves less than perfect accuracy due to limited processing power, environmental noise, and/or natural variations in speaking style. The alternate input modalities use disambiguation or recognition engines to compensate for reduced keyboards, sloppy input, and/or natural variations in writing style. The ambiguity remaining in the speech recognition process is mostly orthogonal to the ambiguity inherent in the alternate input modality, such that the combination of the two modalities resolves the recognition errors efficiently and accurately. The invention is especially well suited for mobile devices with limited space for keyboards or touch-screen input.
155 Citations
8 Claims
-
1. A computer-implemented method for processing language input in a system that includes a server and mobile computer, the mobile computer including a microphone and a display and a text input device operable by a user, the method comprising operations of:
-
(a) responsive to the mobile computing device receiving via the microphone voice input comprising multiple discrete utterances from a user, converting the voice input into a digital sequence of vectors and then wirelessly transmitting the digital sequence of vectors to the server; (b) the server creating an initial N-best list of words corresponding to each of the utterances by conducting, speech recognition operations including matching the vectors to potential phonemes and matching the phonemes against a lexicon model and a language model, the operation of creating each initial N-best list of words further considering context of the corresponding utterance with respect to words of N-best lists corresponding to others of the received utterances, said context including subject-verb agreement, proper case, proper gender, and numerical agreement; (c) the server transmitting each of the initial N-best lists of words to the mobile computing device; (d) for each of said utterances, the mobile computing device visually displaying a best word from the initial N-best list of words corresponding to said utterance; (e) responsive to implied or explicit user selection of one of the displayed best words, said selected word being from a given N-best list of words corresponding to a given utterance, causing the display to present additional words from the given initial N-best list of words; (f) during said presentation of the additional words, the mobile computing device receiving via the text input device hand-entered input from a user, and responsive to said text input, constraining said presentation of the additional words to exclude words of the given initial N-best list that are inconsistent with the textual input; (g) responsive to said presentation of the additional words being constrained to a resultant word, displaying the resultant word instead of the selected word and transmitting the resultant word to the server; (h) responsive to receiving the resultant word, the server updating the initial N-best lists of others of the utterances besides the given utterance to provide subject-verb agreement, employ proper case, use proper gender, and exhibit numerical agreement when considered in context of the resultant word, and transmitting the updated N-best lists to the mobile computing device; (i) for each of the utterances having an updated N-best list, the mobile computing device causing the display to present a best word of the updated N-best list of words for that utterance; and (j) for each of the utterances without an updated N-best list, the mobile computing device causing the display to present a best word of the initial N-best list of words for said utterance. - View Dependent Claims (2, 3, 4)
-
-
5. A system for processing language input, comprising:
-
a server; and a mobile computer including a microphone and a display and a text input device operable by a user; where the server and the mobile computer are programmed to perform computer-implemented operations comprising; (a) responsive to the mobile computing device receiving via the microphone voice input comprising multiple discrete utterances from a user, converting the voice input into a digital sequence of vectors and then wirelessly transmitting the digital sequence of vectors to the server; (b) the server creating an initial N-best list of words corresponding to each of the utterances by conducting speech recognition operations including matching the vectors to potential phonemes and matching the phonemes against a lexicon model and a language model, the operation of creating each initial N-best list of words further considering context of the corresponding utterance with respect to words of N-best lists corresponding to others of the received utterances, said context including subject-verb agreement, proper case, proper gender, and numerical agreement; (c) the server transmitting each of the initial N-best lists of words to the mobile computing device; (d) for each of said utterances, the mobile computing device visually displaying a best word from the initial N-best list of words corresponding to said utterance; (e) responsive to implied or explicit user selection of one of the displayed best words, said selected word being from a given N-best list of words corresponding to a given utterance, causing the display to present additional words from the given initial N-best list of words; (f) during said presentation of the additional words, the mobile computing device receiving via the text input device hand-entered input from a user, and responsive to said text input, constraining said presentation of the additional words to exclude words of the given initial N-best list that are inconsistent with the textual input; (g) responsive to said presentation of the additional words being constrained to a resultant word, displaying the resultant word instead of the selected word and transmitting the resultant word to the server; (h) responsive to receiving the resultant word, the server updating the initial N-best lists of others of the utterances besides the given utterance to provide subject-verb agreement, employ proper case, use proper gender, and exhibit numerical agreement when considered in context of the resultant word, and transmitting the updated N-best lists to the mobile computing device; (i) for each of the utterances having an updated N-best list, the mobile computing device causing the display to present a best word of the updated N-best list of words for that utterance; and (j) for each of the utterances without an updated N-best list, the mobile computing device causing the display to present a best word of the initial N-best list of words for said utterance. - View Dependent Claims (6, 7, 8)
-
Specification