SATISFYING SPECIFIED INTENT(S) BASED ON MULTIMODAL REQUEST(S)

US 20140330570A1
Filed: 07/21/2014
Published: 11/06/2014
Est. Priority Date: 12/15/2011
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

determining that a camera of a processing system is pointed at one or more objects or a scene;

turning on speech understanding functionality of the processing system, using at least one processor of the processing system, in response to determining that the camera is pointed at the one or more objects or the scene, the speech understanding functionality enabling the processing system to understand natural language requests; and

automatically monitoring audio signals received via an audio interface of the processing system for speech requests from a user of the processing system to be processed using the speech understanding functionality in response to determining that the camera is pointed at the one or more objects or the scene.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are described herein that are capable of satisfying specified intent(s) based on multimodal request(s). A multimodal request is a request that includes at least one request of a first type and at least one request of a second type that is different from the first type. Example types of request include but are not limited to a speech request, a text command, a tactile command, and a visual command. A determination is made that one or more entities in visual content are selected in accordance with an explicit scoping command from a user. In response, speech understanding functionality is automatically activated, and audio signals are automatically monitored for speech requests from the user to be processed using the speech understanding functionality.

Citations

20 Claims

1. A method comprising:
- determining that a camera of a processing system is pointed at one or more objects or a scene;
  
  turning on speech understanding functionality of the processing system, using at least one processor of the processing system, in response to determining that the camera is pointed at the one or more objects or the scene, the speech understanding functionality enabling the processing system to understand natural language requests; and
  
  automatically monitoring audio signals received via an audio interface of the processing system for speech requests from a user of the processing system to be processed using the speech understanding functionality in response to determining that the camera is pointed at the one or more objects or the scene.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising:
    - dynamically identifying one or more entities that correspond to the one or more objects or the scene in an image that is captured by the camera, in response to determining that the camera is pointed at the one or more objects or the scene;
      
      dynamically identifying a scope of an interaction between the user and the processing system, the scope being associated with the one or more entities, in response to determining that the camera is pointed at the one or more objects or the scene; and
      
      adapting an understanding of an intent of the user, which is used by the speech understanding functionality to understand the natural language requests, based on the scope.
  - 3. The method of claim 2, further comprising:
    - determining a plurality of possible intents that are available to be satisfied with respect to the one or more entities; and
      
      providing one or more representations of one or more respective possible intents of the plurality of possible intents for the user in response to dynamically identifying the one or more entities and further in response to dynamically identifying the scope.
  - 4. The method of claim 3, wherein providing the one or more representations comprises:
    - suggesting one or more exemplary natural language requests for use by the user to request satisfaction of the one or more respective possible intents, each of the one or more exemplary natural language requests capable of being understood by the processing system using the speech understanding functionality.
  - 5. The method of claim 1, wherein automatically monitoring the audio signals comprises:
    - receiving a first speech request from the user, the first speech request indicating a specified intent to be satisfied with respect to one or more entities that correspond to the one or more objects or the scene in an image that is captured by the camera; and
      
      wherein the method further comprises;
      
      generating a language response for the user based on the first speech request, the language response pertaining to satisfaction of the specified intent.
  - 6. The method of claim 1, wherein automatically monitoring the audio signals comprises:
    - receiving a first speech request from the user, the first speech request indicating a specified intent to be satisfied with respect to one or more entities that correspond to the one or more objects or the scene in an image that is captured by the camera; and
      
      wherein the method further comprises;
      
      satisfying the specified intent with respect to the one or more entities; and
      
      adapting an understanding of an intent of the user, which is used by the speech understanding functionality to understand the natural language requests, based on the specified intent.
  - 7. The method of claim 1, further comprising:
    - automatically monitoring a tactile interface of the processing system to detect tactile commands from the user in response to determining that the camera is pointed at the one or more objects or the scene.
  - 8. The method of claim 1, further comprising:
    - automatically monitoring textual information received via a textual interface of the processing system for textual commands from the user in response to determining that the camera is pointed at the one or more objects or the scene.

9. A processing system comprising:
- a display configured to display visual content;
  
  a camera configured to capture visual information;
  
  determination logic configured to determine whether one or more entities in the visual content are selected in accordance with a visual command from a user, the visual command identifying the one or more entities in the visual content;
  
  speech understanding logic configured to understand natural language requests;
  
  activation logic configured to turn on speech understanding functionality of the speech understanding logic in response to a determination that the one or more entities are selected in accordance with the visual command;
  
  an audio interface configured to receive audio signals; and
  
  monitoring logic configured to monitor the visual information for visual commands from the user, the monitoring logic further configured to automatically monitor the audio signals for speech requests from the user in response to the determination that the one or more entities are selected, the monitoring logic further configured to provide the speech requests to the speech understanding logic for processing.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17)
- - 10. The processing system of claim 9, further comprising:
    - identification logic configured to dynamically identify the one or more entities and a scope of an interaction between the user and the processing system, the scope being associated with the one or more entities, in response to the determination that the one or more entities are selected; and
      
      intent logic configured to adapt an understanding of an intent of the user based on the scope;
      
      wherein the speech understanding logic is configured to use the understanding of the intent of the user to understand the natural language requests.
  - 11. The processing system of claim 10, further comprising:
    - availability logic configured to determine a plurality of possible intents that are available to be satisfied with respect to the one or more entities;
      
      wherein the display displays one or more visual representations of one or more respective possible intents of the plurality of possible intents in response to the one or more entities and the scope being dynamically identified.
  - 12. The processing system of claim 11, wherein the one or more visual representations include one or more respective exemplary natural language requests that are suggested for use by the user to request satisfaction of the one or more respective possible intents;
    - andwherein the speech understanding logic is capable of understanding each of the one or more exemplary natural language requests.
  - 13. The processing system of claim 10, further comprising:
    - availability logic configured to determine a plurality of possible intents that are available to be satisfied with respect to the one or more entities; and
      
      a second audio interface that provides one or more audio representations of one or more respective possible intents of the plurality of possible intents for the user in response to the one or more entities and the scope being dynamically identified.
  - 14. The processing system of claim 13, wherein the one or more audio representations include one or more respective exemplary natural language requests that are suggested for use by the user to request satisfaction of the one or more respective possible intents;
    - andwherein the speech understanding logic is capable of understanding each of the one or more exemplary natural language requests.
  - 15. The processing system of claim 9, further comprising:
    - intent logic configured to satisfy a specified intent with respect to the one or more entities based on receipt of a first speech request from the user, the first speech request indicating that the specified intent is to be satisfied with respect to the one or more entities, the intent logic further configured to adapt an understanding of an intent of the user based on the specified intent;
      
      wherein the speech understanding logic is configured to use the understanding of the intent of the user to understand the natural language requests.
  - 16. The processing system of claim 9, wherein the display includes a tactile interface configured to detect contact of an object with the display;
    - andwherein the monitoring logic is further configured to automatically monitor the tactile interface to detect tactile commands from the user in response to the determination that the one or more entities are selected.
  - 17. The processing system of claim 9, further comprising:
    - a textual interface configured to capture textual information;
      
      wherein the monitoring logic is further configured to automatically monitor the textual information for textual commands from the user in response to the determination that the one or more entities are selected.

18. A processing system comprising:
- determination logic that includes electrical circuitry and is configured to determine an intent of a user;
  
  intent logic configured to satisfy the intent;
  
  multimodal logic configured to present one or more representations that correspond to the intent of the user via an interface in response to satisfaction of the intent, each of the one or more representations including at least one carrier phrase and at least one slot; and
  
  speech understanding logic configured to receive a spoken response via a sensor in response to presentation of the one or more representations, the spoken response including one or more carrier phrases and further including one or more words in lieu of one or more slots, the spoken response indicating a task to be performed,the intent logic further configured to perform the task in response to receipt of the spoken response.
- View Dependent Claims (19, 20)
- - 19. The processing system of claim 18, wherein the processing system is a tablet computer.
  - 20. The processing system of claim 18, wherein the processing system is a mobile phone.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Stifelman, Lisa J., Sullivan, Anne K., Elman, Adam D., Heck, Larry Paul, Tryphonas, Stephanos, Zargahi, Kamran Rajabi, Thai, Ken H.

Granted Patent

US 9,542,949 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/275
CPC Class Codes

G06F 3/0482   Interaction with lists of s...

G06F 3/167   Audio in a user interface, ...

G10L 15/18   using natural language mode...

G10L 17/22   Interactive procedures; Man...

G10L 2015/223   Execution procedure of a sp...

SATISFYING SPECIFIED INTENT(S) BASED ON MULTIMODAL REQUEST(S)

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SATISFYING SPECIFIED INTENT(S) BASED ON MULTIMODAL REQUEST(S)

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links