Eye Gaze for Spoken Language Understanding in Multi-Modal Conversational Interactions

US 20160091967A1
Filed: 09/25/2014
Published: 03/31/2016
Est. Priority Date: 09/25/2014
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

identifying visual elements available for user interaction in a visual context;

receiving user input associated with one or more of the visual elements in the visual context, the user input comprising;

an utterance derived from speech input referring to a particular visual element of the one or more visual elements; and

a gaze input associated with at least some of the one or more visual elements, the at least some of the one or more visual elements including the particular visual element;

extracting lexical features and gaze features based at least in part on the visual elements and the user input; and

determining the particular visual element based at least in part on the lexical features and gaze features.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Improving accuracy in understanding and/or resolving references to visual elements in a visual context associated with a computerized conversational system is described. Techniques described herein leverage gaze input with gestures and/or speech input to improve spoken language understanding in computerized conversational systems. Leveraging gaze input and speech input improves spoken language understanding in conversational systems by improving the accuracy by which the system can resolve references—or interpret a user'"'"'s intent—with respect to visual elements in a visual context. In at least one example, the techniques herein describe tracking gaze to generate gaze input, recognizing speech input, and extracting gaze features and lexical features from the user input. Based at least in part on the gaze features and lexical features, user utterances directed to visual elements in a visual context can be resolved.

Citations

20 Claims

1. A computer-implemented method comprising:
- identifying visual elements available for user interaction in a visual context;
  
  receiving user input associated with one or more of the visual elements in the visual context, the user input comprising;
  
  an utterance derived from speech input referring to a particular visual element of the one or more visual elements; and
  
  a gaze input associated with at least some of the one or more visual elements, the at least some of the one or more visual elements including the particular visual element;
  
  extracting lexical features and gaze features based at least in part on the visual elements and the user input; and
  
  determining the particular visual element based at least in part on the lexical features and gaze features.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A computer-implemented method as claim 1 recites, wherein the visual context is a free-form web browser or an application interface.
  - 3. A computer-implemented method as claim 1 recites, wherein the gaze input comprises eye gaze input associated with at least the intended visual element or head pose input associated with at least the intended element, wherein the user head pose input serves as a proxy for eye gaze input.
  - 4. A computer-implemented method as claim 1 recites, further comprising calculating probabilities associated with individual visual elements of the visual elements to determine the particular visual element, the probabilities based at least in part on the lexical features and the gaze features.
  - 5. A computer-implemented method as claim 4 recites, further comprising:
    - filtering the individual visual elements based at least in part on the calculated probabilities;
      
      identifying a set of visual elements based at least in part on the individual visual elements in the set of visual elements having probabilities above a predetermined threshold; and
      
      identifying the particular visual element from the set of visual elements.
  - 6. A computer-implemented method as claim 1 recites, wherein extracting gaze features comprises:
    - identifying a plurality of fixation points associated with the gaze input;
      
      grouping a predetermined number of the plurality of fixation points together in a cluster; and
      
      identify a centroid of the cluster as a specific fixation point for extracting the gaze features.
  - 7. A computer-implemented method as claim 6 recites, wherein extracting the gaze features comprises:
    - computing a start time and an end time of the speech input; and
      
      extracting the gaze features based at least in part on;
      
      distances between the specific fixation point and an area associated with individual visual elements of the visual elements;
      
      the start time of the speech input; and
      
      the end time of the speech input.
  - 8. A computer-implemented method as claim 1 recites, wherein the particular visual element is associated with an action and the method further comprises, based at least in part on identifying the particular visual element, causing the action associated with the intended visual element to be performed in the visual context.

9. One or more computer-readable media encoded with instructions that, when executed by a processor, configure a computer to perform acts comprising:
- identifying visual elements for receiving user interaction in a visual context;
  
  receiving a user utterance transcribed from speech input referring to a first visual element of the visual elements in the visual context;
  
  receiving gaze input associated with at least a second visual element of the visual elements in the visual context;
  
  extracting lexical features based at least in part on the user utterance and the visual elements;
  
  extracting gaze features based at least in part on the gaze input and the visual elements; and
  
  determining the first visual element based at least in part on the lexical features and gaze features.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. One or more computer-readable media as recited in claim 9, wherein the acts further comprise extracting heat map features based at least in part on the gaze input and the visual elements.
  - 11. One or more computer-readable media as recited in claim 9, wherein the acts further comprise determining a bounding box for individual visual elements of the visual elements, the bounding box comprising an area associated with the individual visual elements.
  - 12. One or more computer-readable media as recited in claim 11, wherein extracting gaze features comprises computing distances between bounding boxes for the individual visual elements and fixation points associated with the gaze input at predetermined times.
  - 13. One or more computer-readable media as recited in claim 9, wherein extracting lexical features comprises computing a lexical similarity between text associated with individual visual elements of the visual elements and the user utterance.
  - 14. One or more computer-readable media as recited in claim 9, wherein determining the particular visual element comprises classifying the visual elements based at least in part on applying a binary classifier to at least one of the lexical features and gaze features.

15. A system comprising:
- computer-readable media;
  
  one or more processors; and
  
  one or more modules on the computer-readable media and executable by the one or more processors, the one or more modules including;
  
  a receiving module configured to receive;
  
  a user utterance transcribed from speech input referring to a particular visual element of a plurality of visual elements presented on a user interface associated with a visual context; and
  
  gaze input directed to one or more of the plurality of visual elements presented on the user interface associated with the visual context;
  
  an extraction module configured to extract a set of features based at least in part on the plurality of visual elements, the user utterance, and the gaze input; and
  
  an analysis module configured to identify the particular visual element based at least in part on the set of features.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. A system as claim 15 recites, further comprising a display module configured to display the plurality of visual elements on the user interface.
  - 17. A system as claim 15 recites, wherein the set of features includes at least:
    - lexical features, wherein lexical features represent lexical similarity between text associated with individual visual elements of the plurality of visual elements and the user utterance; and
      
      gaze features, wherein gaze features represent distances between bounding boxes associated with the individual visual elements and fixation points associated with the gaze input at predetermined times.
  - 18. A system as claim 17 recites, wherein the analysis module is further configured to calculate probabilities associated with individual visual elements of the plurality of visual elements to identify the particular visual element, the probabilities based at least in part on the lexical features and the gaze features.
  - 19. A system as claim 18 recites, wherein the analysis module is further configured to identify the particular visual element based at least in part on the particular element having a highest probability of all of the calculated probabilities associated with the plurality of visual elements.
  - 20. A system as claim 18 recites, wherein the analysis module is further configured to:
    - classify the lexical features in a first process;
      
      classify the gaze features in a second process, the second process at a time different from the first process; and
      
      based at least in part on classifying the lexical features and classifying the gaze features;
      
      calculate probabilities associated with individual visual elements of the plurality of visual elements to identify the particular visual element; and
      
      identify the particular visual element based at least in part on the calculated probabilities.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Celikyilmaz, Fethiye Asli, Hakkani-Tur, Dilek Z., Heck, Larry, Slaney, Malcolm, Prokofieva, Anna

Granted Patent

US 10,317,992 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G02B 27/0093   with means for monitoring d...

G06F 2203/0381   Multimodal input, i.e. inte...

G06F 3/012   Head tracking input arrange...

G06F 3/013   Eye tracking input arrangem...

G06F 3/167   Audio in a user interface, ...

G06V 40/174   Facial expression recognition

G06V 40/18   Eye characteristics, e.g. o...

G06V 40/20   Movements or behaviour, e.g...

G10L 15/00   Speech recognition G10L17/0...

G10L 15/08   Speech classification or se...

G10L 17/22   Interactive procedures; Man...

Eye Gaze for Spoken Language Understanding in Multi-Modal Conversational Interactions

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Eye Gaze for Spoken Language Understanding in Multi-Modal Conversational Interactions

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links