System and method for processing multi-modal device interactions in a natural language voice services environment

US 9,953,649 B2
Filed: 02/13/2017
Issued: 04/24/2018
Est. Priority Date: 02/20/2009
Status: Active Grant

First Claim

Patent Images

1. A method for processing one or more multi-modal device interactions in a natural language voice services environment that includes a plurality of electronic devices each separate from one another, the plurality of electronic devices including a first electronic device having at least a non-voice input device and a second electronic device having at least a voice input device, the method comprising:

receiving, by a voice-click component of at least one of the plurality of electronic devices, a non-voice interaction detected at the first electronic device and a natural language utterance detected at the second electronic device;

determining, by the voice-click component, first context information relating to the non-voice interaction, wherein the first context information includes context relating to the non-voice interaction;

determining, by the voice-click component, second context information relating to the natural language utterance, wherein the second context information includes context relating to the natural language utterance;

determining, by the voice-click component, an intent based on the first context relating to the non-voice interaction and the second context relating to the natural language utterance;

generating, by the voice-click component, a request based on the determined intent; and

transmitting, by the voice-click component, the request to a target electronic device from among the plurality of electronic devices.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for processing multi-modal device interactions in a natural language voice services environment may be provided. In particular, one or more multi-modal device interactions may be received in a natural language voice services environment that includes one or more electronic devices. The multi-modal device interactions may include a non-voice interaction with at least one of the electronic devices or an application associated therewith, and may further include a natural language utterance relating to the non-voice interaction. Context relating to the non-voice interaction and the natural language utterance may be extracted and combined to determine an intent of the multi-modal device interaction, and a request may then be routed to one or more of the electronic devices based on the determined intent of the multi-modal device interaction.

850 Citations

21 Claims

1. A method for processing one or more multi-modal device interactions in a natural language voice services environment that includes a plurality of electronic devices each separate from one another, the plurality of electronic devices including a first electronic device having at least a non-voice input device and a second electronic device having at least a voice input device, the method comprising:
- receiving, by a voice-click component of at least one of the plurality of electronic devices, a non-voice interaction detected at the first electronic device and a natural language utterance detected at the second electronic device;
  
  determining, by the voice-click component, first context information relating to the non-voice interaction, wherein the first context information includes context relating to the non-voice interaction;
  
  determining, by the voice-click component, second context information relating to the natural language utterance, wherein the second context information includes context relating to the natural language utterance;
  
  determining, by the voice-click component, an intent based on the first context relating to the non-voice interaction and the second context relating to the natural language utterance;
  
  generating, by the voice-click component, a request based on the determined intent; and
  
  transmitting, by the voice-click component, the request to a target electronic device from among the plurality of electronic devices.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, the method further comprising:
    - generating, by the first electronic device, a user interface, wherein receiving the non-voice interaction comprises;
      
      receiving, by the first electronic device, a point of focus on the user interface, and wherein the first context is based on the point of focus.
  - 3. The method of claim 2, wherein the user interface comprises a map display interface and wherein the point of focus includes a location on the map display that indicates a geolocation, which is used as the first context.
  - 4. The method of claim 3, the method further comprising:
    - generating, by the voice-click component, a transaction lead based on the geolocation; and
      
      displaying, on the first electronic device, the transaction lead as a selectable display option on the map display.
  - 5. The method of claim 1, the method further comprising:
    - storing, by the voice-click component, a plurality of types of nonvoice interactions that are recognized by the voice-click component;
      
      receiving, by the voice-click component, a new type of nonvoice interaction to be recognized; and
      
      storing, by the voice-click component, the new type of nonvoice interaction with the plurality of types of nonvoice interactions.
  - 6. The method of claim 1, the method further comprising:
    - storing, by the voice-click component, a list of the plurality of electronic devices from which the voice-click components is configured to receive nonvoice interactions and/or voice interactions;
      
      monitoring, by the voice-click component, the plurality of electronic devices based on the list.
  - 7. The method of claim 6, the method further comprising:
    - determining, by the first electronic device, a first time at which the nonvoice interaction was detected at the first electronic device;
      
      determining, by the second electronic device, a second time at which the natural language utterance was detected at the first electronic device;
      
      obtaining, by the voice-click component, the first time and the second time; and
      
      determining, by the voice-click component, that the nonvoice interaction and the natural language utterance relate form a multi-modal interaction that is to be interpreted together based on the first time and the second time.
  - 8. The method of claim 6, the method further comprising:
    - receiving, by the voice-click component, and indication of a new electronic device to be added to the list of the plurality of electronic devices;
      
      adding, by the voice-click component, the new electronic device to the list of the plurality of electronic device; and
      
      establishing, by the voice-click component, a new listener to monitor the new electronic device.
  - 9. The method of claim 1, the method further comprising:
    - obtaining, by the voice-click component, a constellation model that provides knowledge relating to content, services, applications, and/or intent determination capabilities of each of the plurality of electronic devices; and
      
      identifying, by the voice-click component, the target electronic device based on the constellation model and the request based on a determination that the target electronic device is able to handle the request as specified in the constellation model, wherein the request is transmitted to the target electronic device.
  - 10. The method of claim 1, the method further comprising:
    - obtaining, by the voice-click component, one or more words or phrases of the natural language utterance; and
      
      determining, by the voice-click component, the intent based on the one or more words or phrases, the first context, and the second context.
  - 11. The method of claim 1, wherein the voice-click component executes on the first electronic device, the second electronic device, or another one of the plurality of electronic devices.

12. A system for processing one or more multi-modal device interactions in a natural language voice services environment that includes a plurality of electronic devices each separate from one another, the plurality of electronic devices including a first electronic device having at least a non-voice input device and a second electronic device having at least a voice input device, the system:
- a computer system comprising one or more physical processors implementing a voice-click component to;
  
  receive a non-voice interaction detected at the first electronic device and a natural language utterance detected at the second electronic device;
  
  determine first context information relating to the non-voice interaction, wherein the first context information includes context relating to the non-voice interaction;
  
  determine second context information relating to the natural language utterance, wherein the second context information includes context relating to the natural language utterance;
  
  determine an intent based on the first context relating to the non-voice interaction and the second context relating to the natural language utterance;
  
  generate a request based on the determined intent; and
  
  transmit the request to a target electronic device from among the plurality of electronic devices.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 13. The system of claim 12, wherein the first electronic device is programmed to:
    - generate a user interface, wherein receiving the non-voice interaction comprises;
      
      receive a point of focus on the user interface, and wherein the first context is based on the point of focus.
  - 14. The system of claim 13, wherein the user interface comprises a map display interface and wherein the point of focus includes a location on the map display that indicates a geolocation, which is used as the first context.
  - 15. The system of claim 14, wherein the voice-click component is further configured to:
    - generate a transaction lead based on the geolocation; and
      
      wherein the first electronic device is configured to;
      
      display the transaction lead as a selectable display option on the map display.
  - 16. The system of claim 12, wherein the voice-click component is further configured to:
    - store a plurality of types of nonvoice interactions that are recognized by the voice-click component;
      
      receive a new type of nonvoice interaction to be recognized; and
      
      store the new type of nonvoice interaction with the plurality of types of nonvoice interactions.
  - 17. The system of claim 12, wherein the voice-click component is further configured to:
    - store a list of the plurality of electronic devices from which the voice-click components is configured to receive nonvoice interactions and/or voice interactions;
      
      monitor the plurality of electronic devices based on the list.
  - 18. The system of claim 17, wherein the first electronic device is further configured to:
    - determine a first time at which the nonvoice interaction was detected at the first electronic device;
      
      determine a second time at which the natural language utterance was detected at the first electronic device;
      
      obtain the first time and the second time; and
      
      determine that the nonvoice interaction and the natural language utterance relate form a multi-modal interaction that is to be interpreted together based on the first time and the second time.
  - 19. The system of claim 17, wherein the voice-click component is further configured to:
    - receive an indication of a new electronic device to be added to the list of the plurality of electronic devices;
      
      add the new electronic device to the list of the plurality of electronic device; and
      
      establish a new listener to monitor the new electronic device.
  - 20. The system of claim 12, wherein the voice-click component is further configured to:
    - obtain a constellation model that provides knowledge relating to content, services, applications, and/or intent determination capabilities of each of the plurality of electronic devices; and
      
      identify the target electronic device based on the constellation model and the request based on a determination that the target electronic device is able to handle the request as specified in the constellation model, wherein the request is transmitted to the target electronic device.
  - 21. The system of claim 12, wherein the voice-click component is further configured to:
    - obtain one or more words or phrases of the natural language utterance; and
      
      determine the intent based on the one or more words or phrases, the first context, and the second context.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oracle International Corporation (Oracle Corporation)
Original Assignee
VoiceBox Technologies Corporation (Microsoft Corporation)
Inventors
Weider, Chris, Baldwin, Larry
Primary Examiner(s)
Yo, Huyen

Application Number

US15/430,952
Publication Number

US 20170221482A1
Time in Patent Office

435 Days
Field of Search

704 1- 10, 704230, 704233, 704235, 704250, 704251, 704255, 704257, 704270, 7042701
US Class Current
CPC Class Codes

G06Q 30/02   Marketing; Price estimation...

G06Q 30/0241   Advertisements

G06Q 30/0261   based on user location

G06Q 30/0273   Determination of fees for a...

G10L 15/18   using natural language mode...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/22   Procedures used during a sp...

G10L 15/24   Speech recognition using no...

G10L 17/22   Interactive procedures; Man...

G10L 2015/223   Execution procedure of a sp...

G10L 2015/227   of the speaker; Human-fact...

System and method for processing multi-modal device interactions in a natural language voice services environment

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

850 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for processing multi-modal device interactions in a natural language voice services environment

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

850 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links