Eye gaze for spoken language understanding in multi-modal conversational interactions
First Claim
1. A computer-implemented method comprising:
- identifying a plurality of visual elements available for user interaction in a visual context on a display;
receiving speech input including one or more words spoken by a user;
extracting lexical features from the speech input;
computing, for each visual element of the plurality of visual elements, a lexical similarity between the lexical features and the respective visual element of the plurality of visual elements and a lexical probability for each lexical similarity;
receiving, from a tracking component, a gaze input;
determining, from the gaze input, a heat map representing a probabilistic model of objects the user is looking at in the visual context on the display, the objects including the plurality of visual elements;
determining that a particular visual element of the plurality of visual elements is an intended visual element of the speech input using a combination of a lexical probability of the lexical probabilities and the heat map;
determining, by one or more processors, that the speech input comprises a command directed to the particular visual element; and
causing an action associated with the particular visual element to be performed.
2 Assignments
0 Petitions
Accused Products
Abstract
Improving accuracy in understanding and/or resolving references to visual elements in a visual context associated with a computerized conversational system is described. Techniques described herein leverage gaze input with gestures and/or speech input to improve spoken language understanding in computerized conversational systems. Leveraging gaze input and speech input improves spoken language understanding in conversational systems by improving the accuracy by which the system can resolve references—or interpret a user'"'"'s intent—with respect to visual elements in a visual context. In at least one example, the techniques herein describe tracking gaze to generate gaze input, recognizing speech input, and extracting gaze features and lexical features from the user input. Based at least in part on the gaze features and lexical features, user utterances directed to visual elements in a visual context can be resolved.
40 Citations
20 Claims
-
1. A computer-implemented method comprising:
-
identifying a plurality of visual elements available for user interaction in a visual context on a display; receiving speech input including one or more words spoken by a user; extracting lexical features from the speech input; computing, for each visual element of the plurality of visual elements, a lexical similarity between the lexical features and the respective visual element of the plurality of visual elements and a lexical probability for each lexical similarity; receiving, from a tracking component, a gaze input; determining, from the gaze input, a heat map representing a probabilistic model of objects the user is looking at in the visual context on the display, the objects including the plurality of visual elements; determining that a particular visual element of the plurality of visual elements is an intended visual element of the speech input using a combination of a lexical probability of the lexical probabilities and the heat map; determining, by one or more processors, that the speech input comprises a command directed to the particular visual element; and causing an action associated with the particular visual element to be performed. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A device comprising:
-
one or more processors; computer-readable media encoded with instructions that, when executed by the one or more processors, configure the device to perform acts comprising; identifying a plurality of visual elements for receiving user interaction in a visual context on a display; determining a user utterance transcribed from speech input comprising one or more words spoken in a particular language, the user utterance comprising a command to perform an action; receiving, from an eye tracking component, gaze input; determining, from the gaze input, a heat map representing a probabilistic model of objects the user is looking at in the visual context on the display, the objects including the plurality of visual elements; extracting lexical features based at least in part on the user utterance; computing, for each visual element of the plurality of visual elements, a lexical similarity between the lexical features and the respective visual element of the plurality of visual elements and a lexical probability for each lexical similarity; extracting gaze features based at least in part on the heat map; and determining that the command to perform the action is directed to an intended visual element using a combination of a lexical probability of the lexical probabilities and the gaze features. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A system comprising:
-
an eye tracking sensor; a display; computer-readable media; one or more processors; and modules stored on the computer-readable media and executable by the one or more processors, the modules comprising; a receiving module configured to receive; speech input comprising one or more words referring to a particular visual element of a plurality of visual elements presented on a user interface of the display; and gaze input from the tracking component, the gaze input directed to one or more of the plurality of visual elements presented on the user interface; an extraction module configured to; determine, from the gaze input, a heat map representing a probabilistic model of objects a user is looking at in a visual context on the display, the objects including the plurality of visual elements; extract lexical features from the speech input; compute, for each visual element of the plurality of visual elements, a lexical similarity between the extracted lexical features and the respective visual element of the plurality of visual elements; and an analysis module configured to compute a lexical probability for each lexical similarity and to identify the particular visual element using a combination of a lexical probability of the lexical probabilities and the heat map. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification