Multimodal disambiguation of speech recognition
First Claim
1. A computer-implemented method comprising:
- receiving, by a mobile device, a voice input;
displaying, by the mobile device at a text insertion point of a touch screen display, a most likely interpretation of the voice input, the most likely interpretation resulting from a speech recognition process;
receiving, by the mobile device on the touch screen display, a first non-voice input that selects said most likely interpretation;
responsive to the first non-voice input, displaying for selection, by the mobile device on the touch screen display, two or more word candidates that are ordered by phonemic similarity to the most likely interpretation,wherein the most likely interpretation and the two or more word candidates are displayed in a single window, andwherein selection of the two or more word candidates from a list of known words is based at least in part on a confusability matrix that considers error frequency of one or more phonemes included in the most likely interpretation and positional context of the one or more phonemes within the most likely interpretation;
receiving, by the mobile device, a second non-voice input that represents a selection of an intended word candidate from among said two or more word candidates; and
automatically replacing, by the mobile device, the most likely interpretation with the intended word candidate at the text insertion point.
6 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides a speech recognition system combined with one or more alternate input modalities to ensure efficient and accurate text input. The speech recognition system achieves less than perfect accuracy due to limited processing power, environmental noise, and/or natural variations in speaking style. The alternate input modalities use disambiguation or recognition engines to compensate for reduced keyboards, sloppy input, and/or natural variations in writing style. The ambiguity remaining in the speech recognition process is mostly orthogonal to the ambiguity inherent in the alternate input modality, such that the combination of the two modalities resolves the recognition errors efficiently and accurately. The invention is especially well suited for mobile devices with limited space for keyboards or touch-screen input.
287 Citations
21 Claims
-
1. A computer-implemented method comprising:
-
receiving, by a mobile device, a voice input; displaying, by the mobile device at a text insertion point of a touch screen display, a most likely interpretation of the voice input, the most likely interpretation resulting from a speech recognition process; receiving, by the mobile device on the touch screen display, a first non-voice input that selects said most likely interpretation; responsive to the first non-voice input, displaying for selection, by the mobile device on the touch screen display, two or more word candidates that are ordered by phonemic similarity to the most likely interpretation, wherein the most likely interpretation and the two or more word candidates are displayed in a single window, and wherein selection of the two or more word candidates from a list of known words is based at least in part on a confusability matrix that considers error frequency of one or more phonemes included in the most likely interpretation and positional context of the one or more phonemes within the most likely interpretation; receiving, by the mobile device, a second non-voice input that represents a selection of an intended word candidate from among said two or more word candidates; and automatically replacing, by the mobile device, the most likely interpretation with the intended word candidate at the text insertion point. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer program product, tangibly embodied in a non-transitory computer-readable storage medium, the computer program product including instructions operable to cause a data processing apparatus to:
-
receive a voice input; display, at a text insertion point of a touch screen device, a most likely interpretation of the voice input, the most likely interpretation resulting from a speech recognition process; receive, on the touch screen display, a first non-voice input that selects said most likely interpretation; responsive to the first non-voice input, display for selection two or more word candidates on the touch screen display, wherein the two or more word candidates are ordered by phonemic similarity to the most likely interpretation, wherein the most likely interpretation and the two or more word candidates are displayed in a single window, and wherein selection of the two or more word candidates from a list of known words is based at least in part on a confusability matrix that considers error frequency of one or more phonemes included in the most likely interpretation and positional context of the one or more phonemes within the most likely interpretation; receive, at said non-voice input field, a second non-voice input that represents a selection of an intended word candidate from among said two or more word candidates; and automatically replacing the most likely interpretation with the intended word candidate at the text insertion point. - View Dependent Claims (11, 12, 13)
-
-
14. A mobile device including a processor configured to:
-
receive a voice input; display, at a text insertion point of a touch screen display, a most likely interpretation of the voice input, the most likely interpretation resulting from a speech recognition process; receive, on the touch screen display, a first non-voice input that selects said most likely interpretation; responsive to the first non-voice input, display for selection two or more word candidates on the touch screen display that are ordered by phonemic similarity to the most likely interpretation, wherein the most likely interpretation and the two or more word candidates are displayed in a single window, and wherein selection of the two or more word candidates from a list of known words is based at least in part on a confusability matrix that considers error frequency of one or more phonemes included in the most likely interpretation and positional context of the one or more phonemes within the most likely interpretation; receive, at said non-voice input field, a second non-voice input that represents a selection of an intended word candidate from among said two or more word candidates; and automatically replacing the most likely interpretation with the intended word candidate at the text insertion point. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21)
-
Specification