Eye Gaze for Spoken Language Understanding in Multi-Modal Conversational Interactions
First Claim
1. A computer-implemented method comprising:
- identifying visual elements available for user interaction in a visual context;
receiving user input associated with one or more of the visual elements in the visual context, the user input comprising;
an utterance derived from speech input referring to a particular visual element of the one or more visual elements; and
a gaze input associated with at least some of the one or more visual elements, the at least some of the one or more visual elements including the particular visual element;
extracting lexical features and gaze features based at least in part on the visual elements and the user input; and
determining the particular visual element based at least in part on the lexical features and gaze features.
3 Assignments
0 Petitions
Accused Products
Abstract
Improving accuracy in understanding and/or resolving references to visual elements in a visual context associated with a computerized conversational system is described. Techniques described herein leverage gaze input with gestures and/or speech input to improve spoken language understanding in computerized conversational systems. Leveraging gaze input and speech input improves spoken language understanding in conversational systems by improving the accuracy by which the system can resolve references—or interpret a user'"'"'s intent—with respect to visual elements in a visual context. In at least one example, the techniques herein describe tracking gaze to generate gaze input, recognizing speech input, and extracting gaze features and lexical features from the user input. Based at least in part on the gaze features and lexical features, user utterances directed to visual elements in a visual context can be resolved.
-
Citations
20 Claims
-
1. A computer-implemented method comprising:
-
identifying visual elements available for user interaction in a visual context; receiving user input associated with one or more of the visual elements in the visual context, the user input comprising; an utterance derived from speech input referring to a particular visual element of the one or more visual elements; and a gaze input associated with at least some of the one or more visual elements, the at least some of the one or more visual elements including the particular visual element; extracting lexical features and gaze features based at least in part on the visual elements and the user input; and determining the particular visual element based at least in part on the lexical features and gaze features. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. One or more computer-readable media encoded with instructions that, when executed by a processor, configure a computer to perform acts comprising:
-
identifying visual elements for receiving user interaction in a visual context; receiving a user utterance transcribed from speech input referring to a first visual element of the visual elements in the visual context; receiving gaze input associated with at least a second visual element of the visual elements in the visual context; extracting lexical features based at least in part on the user utterance and the visual elements; extracting gaze features based at least in part on the gaze input and the visual elements; and determining the first visual element based at least in part on the lexical features and gaze features. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A system comprising:
-
computer-readable media; one or more processors; and one or more modules on the computer-readable media and executable by the one or more processors, the one or more modules including; a receiving module configured to receive; a user utterance transcribed from speech input referring to a particular visual element of a plurality of visual elements presented on a user interface associated with a visual context; and gaze input directed to one or more of the plurality of visual elements presented on the user interface associated with the visual context; an extraction module configured to extract a set of features based at least in part on the plurality of visual elements, the user utterance, and the gaze input; and an analysis module configured to identify the particular visual element based at least in part on the set of features. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification