Eye gaze for spoken language understanding in multi-modal conversational interactions

US 10,317,992 B2
Filed: 09/25/2014
Issued: 06/11/2019
Est. Priority Date: 09/25/2014
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

identifying a plurality of visual elements available for user interaction in a visual context on a display;

receiving speech input including one or more words spoken by a user;

extracting lexical features from the speech input;

computing, for each visual element of the plurality of visual elements, a lexical similarity between the lexical features and the respective visual element of the plurality of visual elements and a lexical probability for each lexical similarity;

receiving, from a tracking component, a gaze input;

determining, from the gaze input, a heat map representing a probabilistic model of objects the user is looking at in the visual context on the display, the objects including the plurality of visual elements;

determining that a particular visual element of the plurality of visual elements is an intended visual element of the speech input using a combination of a lexical probability of the lexical probabilities and the heat map;

determining, by one or more processors, that the speech input comprises a command directed to the particular visual element; and

causing an action associated with the particular visual element to be performed.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Improving accuracy in understanding and/or resolving references to visual elements in a visual context associated with a computerized conversational system is described. Techniques described herein leverage gaze input with gestures and/or speech input to improve spoken language understanding in computerized conversational systems. Leveraging gaze input and speech input improves spoken language understanding in conversational systems by improving the accuracy by which the system can resolve references—or interpret a user'"'"'s intent—with respect to visual elements in a visual context. In at least one example, the techniques herein describe tracking gaze to generate gaze input, recognizing speech input, and extracting gaze features and lexical features from the user input. Based at least in part on the gaze features and lexical features, user utterances directed to visual elements in a visual context can be resolved.

40 Citations

View as Search Results

20 Claims

1. A computer-implemented method comprising:
- identifying a plurality of visual elements available for user interaction in a visual context on a display;
  
  receiving speech input including one or more words spoken by a user;
  
  extracting lexical features from the speech input;
  
  computing, for each visual element of the plurality of visual elements, a lexical similarity between the lexical features and the respective visual element of the plurality of visual elements and a lexical probability for each lexical similarity;
  
  receiving, from a tracking component, a gaze input;
  
  determining, from the gaze input, a heat map representing a probabilistic model of objects the user is looking at in the visual context on the display, the objects including the plurality of visual elements;
  
  determining that a particular visual element of the plurality of visual elements is an intended visual element of the speech input using a combination of a lexical probability of the lexical probabilities and the heat map;
  
  determining, by one or more processors, that the speech input comprises a command directed to the particular visual element; and
  
  causing an action associated with the particular visual element to be performed.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A computer-implemented method as claim 1 recites, wherein the visual context is a free-form web browser or an application interface.
  - 3. A computer-implemented method as claim 1 recites, further comprising receiving head pose input associated with the particular visual element, wherein the head pose input serves as a proxy for the gaze input.
  - 4. A computer-implemented method as recited in claim 1, wherein using the combination of the lexical probability and the heat map includes:
    - determining an area around each visual element of the plurality of visual elements on the display, each area not intersecting other areas of the determined areas; and
      
      determining distances from each area to fixation points associated with the heat map.
  - 5. A computer-implemented method as claim 4 recites, further comprising:
    - filtering the individual visual elements based at least in part on the respective calculated probabilities;
      
      identifying one or more visual elements that have respective probabilities above a predetermined threshold; and
      
      identifying the particular visual element from the one or more visual elements.
  - 6. A computer-implemented method as claim 1 recites, further comprising:
    - identifying a plurality of fixation points associated with the gaze input;
      
      grouping a predetermined number of the plurality of fixation points together in a cluster; and
      
      identifying a centroid of the cluster as a specific fixation point for extracting gaze features from the gaze input, the gaze features useable to determine that the gaze input is associated with the particular visual element.
  - 7. A computer-implemented method as claim 6 recites, further comprising:
    - computing a start time and an end time of the speech input; and
      
      extracting the gaze features based at least in part on;
      
      distances between the specific fixation point and an area associated with individual visual elements of the plurality of visual elements;
      
      the start time of the speech input; and
      
      the end time of the speech input.
  - 8. A computer-implemented method as claim 1 recites, wherein the action comprises one of a selection of the particular visual element or entry of information into the particular visual element.

9. A device comprising:
- one or more processors;
  
  computer-readable media encoded with instructions that, when executed by the one or more processors, configure the device to perform acts comprising;
  
  identifying a plurality of visual elements for receiving user interaction in a visual context on a display;
  
  determining a user utterance transcribed from speech input comprising one or more words spoken in a particular language, the user utterance comprising a command to perform an action;
  
  receiving, from an eye tracking component, gaze input;
  
  determining, from the gaze input, a heat map representing a probabilistic model of objects the user is looking at in the visual context on the display, the objects including the plurality of visual elements;
  
  extracting lexical features based at least in part on the user utterance;
  
  computing, for each visual element of the plurality of visual elements, a lexical similarity between the lexical features and the respective visual element of the plurality of visual elements and a lexical probability for each lexical similarity;
  
  extracting gaze features based at least in part on the heat map; and
  
  determining that the command to perform the action is directed to an intended visual element using a combination of a lexical probability of the lexical probabilities and the gaze features.
- View Dependent Claims (10, 11, 12, 13)
- - 10. A device as recited in claim 9, wherein the acts further comprise determining a bounding box for individual visual elements of the plurality of visual elements, the bounding box comprising an area associated with the individual visual elements.
  - 11. A device as recited in claim 10, wherein the extracting the gaze features comprises computing distances between bounding boxes for the individual visual elements and fixation points associated with the gaze input at predetermined times.
  - 12. A device as recited in claim 9, wherein computing the lexical similarity includes computing a lexical similarity between the one or more words and text associated with individual visual elements of the plurality of visual elements.
  - 13. A device as recited in claim 9, wherein the determining that the command to perform the action is directed the intended visual element comprises classifying the plurality of visual elements based at least in part on applying a binary classifier to at least one of the lexical features or the gaze features.

14. A system comprising:
- an eye tracking sensor;
  
  a display;
  
  computer-readable media;
  
  one or more processors; and
  
  modules stored on the computer-readable media and executable by the one or more processors, the modules comprising;
  
  a receiving module configured to receive;
  
  speech input comprising one or more words referring to a particular visual element of a plurality of visual elements presented on a user interface of the display; and
  
  gaze input from the tracking component, the gaze input directed to one or more of the plurality of visual elements presented on the user interface;
  
  an extraction module configured to;
  
  determine, from the gaze input, a heat map representing a probabilistic model of objects a user is looking at in a visual context on the display, the objects including the plurality of visual elements;
  
  extract lexical features from the speech input;
  
  compute, for each visual element of the plurality of visual elements, a lexical similarity between the extracted lexical features and the respective visual element of the plurality of visual elements; and
  
  an analysis module configured to compute a lexical probability for each lexical similarity and to identify the particular visual element using a combination of a lexical probability of the lexical probabilities and the heat map.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. A system as claim 14 recites, wherein the extraction module is configured to determine, using the heat map, a gaze probability for each visual element to be a subject of gaze by the user, and the analysis module is configured to identify the particular visual element using a combination of the lexical probability and the gaze probability for each visual element.
  - 16. A system as claim 14 recites, wherein the extraction module configured to compute, for each visual element of the plurality of visual elements, a lexical similarity between the extracted lexical features and the respective visual element of the plurality of visual elements is configured to compute lexical similarity between the one or more words and text associated with individual visual elements of the plurality of visual elements;
    - andthe extraction module is configured to extract gaze features, wherein the gaze features represent distances between bounding boxes associated with the individual visual elements and fixation points associated with the gaze input at predetermined times.
  - 17. A system as claim 16 recites, wherein the analysis module is further configured to calculate probabilities associated with individual visual elements of the plurality of visual elements to identify the particular visual element, with the probabilities based at least in part on the lexical features and the gaze features.
  - 18. A system as claim 17 recites, wherein the analysis module is further configured to identify the particular visual element based at least in part on the particular visual element having a highest probability of all of the calculated probabilities associated with the individual visual elements of the plurality of visual elements.
  - 19. A system as claim 17 recites, wherein the analysis module is further configured to:
    - classify the lexical features in a first process;
      
      classify the gaze features in a second process, at least part of the second process occurring after a time the first process is completed; and
      
      based at least in part on classifying the lexical features and classifying the gaze features;
      
      calculate probabilities associated with individual visual elements of the plurality of visual elements to identify the particular visual element; and
      
      identify the particular visual element based at least in part on the calculated probabilities.
  - 20. A system as claim 14 recites, wherein the analysis module is further configured to classify the plurality of visual elements based at least in part on applying a binary classifier to at least one of a set of features including the lexical similarity and a heat map based upon a probabilistic model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Prokofieva, Anna, Celikyilmaz, Fethiye Asli, Hakkani-Tur, Dilek Z, Heck, Larry, Slaney, Malcolm
Primary Examiner(s)
Chowdhury, Afroza

Application Number

US14/496,538
Publication Number

US 20160091967A1
Time in Patent Office

1,720 Days
Field of Search
US Class Current
CPC Class Codes

G02B 27/0093   with means for monitoring d...

G06F 2203/0381   Multimodal input, i.e. inte...

G06F 3/012   Head tracking input arrange...

G06F 3/013   Eye tracking input arrangem...

G06F 3/167   Audio in a user interface, ...

G06V 40/174   Facial expression recognition

G06V 40/18   Eye characteristics, e.g. o...

G06V 40/20   Movements or behaviour, e.g...

G10L 15/00   Speech recognition G10L17/0...

G10L 15/08   Speech classification or se...

G10L 17/22   Interactive procedures; Man...

Eye gaze for spoken language understanding in multi-modal conversational interactions

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

40 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Eye gaze for spoken language understanding in multi-modal conversational interactions

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

40 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links