Speech-centric multimodal user interface design in mobile technology

US 8,219,406 B2
Filed: 03/15/2007
Issued: 07/10/2012
Est. Priority Date: 03/15/2007
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented interface, comprising:

a set of parsers configured to parse information received from a plurality of sources including a mixed modality of inputs;

a discourse manager configured to;

identify correlations in the information;

interpret the mixed modality of inputs based on environmental data associated with at least one of the mixed modality of inputs;

based on the identified correlations and the interpreted mixed modality of inputs, at least one of determine or infer an intent associated with the information; and

generate a confidence level for the intent as a function of the environmental data; and

a response manager configured to;

evaluate a first input of the mixed modality of inputs, the first input having a first modality initially employed as a primary modality;

based on the generated confidence level, provide feedback to request a second input having a second modality different from the first modality; and

substitute the second modality for the first modality as the primary modality until the environmental data changes.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A multi-modal human computer interface (HCI) receives a plurality of available information inputs concurrently, or serially, and employs a subset of the inputs to determine or infer user intent with respect to a communication or information goal. Received inputs are respectively parsed, and the parsed inputs are analyzed and optionally synthesized with respect to one or more of each other. In the event sufficient information is not available to determine user intent or goal, feedback can be provided to the user in order to facilitate clarifying, confirming, or augmenting the information inputs.

Citations

19 Claims

1. A computer-implemented interface, comprising:
- a set of parsers configured to parse information received from a plurality of sources including a mixed modality of inputs;
  
  a discourse manager configured to;
  
  identify correlations in the information;
  
  interpret the mixed modality of inputs based on environmental data associated with at least one of the mixed modality of inputs;
  
  based on the identified correlations and the interpreted mixed modality of inputs, at least one of determine or infer an intent associated with the information; and
  
  generate a confidence level for the intent as a function of the environmental data; and
  
  a response manager configured to;
  
  evaluate a first input of the mixed modality of inputs, the first input having a first modality initially employed as a primary modality;
  
  based on the generated confidence level, provide feedback to request a second input having a second modality different from the first modality; and
  
  substitute the second modality for the first modality as the primary modality until the environmental data changes.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The computer-implemented interface of claim 1, wherein the first modality is a speech modality, and the environmental data identifies environmental noise.
  - 3. The computer-implemented interface of claim 2, wherein the second modality is a tool-based modality.
  - 4. The computer-implemented interface of claim 2, wherein the response manager is further configured to prompt for re-engagement of the speech modality as the primary modality when the environmental data changes.
  - 5. The computer-implemented interface of claim 4, wherein the discourse manager is configured to utilize the environmental data to consider voice inflection and stress level in the speech modality to generate the confidence level.
  - 6. The computer-implemented interface of claim 1, further comprising an artificial intelligence (AI) component configured to employ a probabilistic-based analysis in connection with inferring the intent.
  - 7. The computer-implemented interface of claim 1, the environmental data comprising at least one of:
    - a user state, a device state, a context of a session of the computer-implemented interface, historical or current extrinsic information about one or both of the plurality of sources or the mixed modality of inputs, or a device capability.
  - 8. The computer-implemented interface of claim 1, the mixed modalities comprising at least three of the following modalities:
    - speech, text, mouse input, pen input, gesture, pattern recognition, gaze, symbol input, audio, expression, external device input, location, temperature, vibration, orientation, or movement.
  - 9. The computer-implemented interface of claim 1, wherein the set of parsers is further configured to utilize language model to parse the information into surface semantics represented by a common modality-independent semantic representation.
  - 10. The computer-implemented interface of claim 9, wherein the discourse manager is further configured to update the environmental data and utilize the updated environmental data to adapt the language model to enhance accuracy of at least one parser of the set of parsers by computing a conditional probability of a phrase of the information.
  - 11. The computer-implemented interface of claim 1, wherein the discourse manager is further configured to employ late modality fusion to integrate the information at a semantic level, wherein each of the first and second modalities has a respective semantic parser with an individual recognizer, the late modality fusion resulting in surface semantics represented by a common modality-independent semantic representation.

12. A computer-readable storage medium storing instructions, the instructions when executed by a computing device causing the computing device to perform operations comprising:
- receiving an input in a first modality as a primary modality;
  
  dynamically generating a first confidence level as a function of environmental data associated with the input, the environmental data comprising at least one of;
  
  a user state, a device state, a context of a computer-implemented interface session, historical or current extrinsic information about the input or a source of the input, or a device capability;
  
  attributing a first weight to the input as a function of the first confidence level;
  
  based on the first weight, determining that the first modality is insufficient as an input and receiving at least one other input in a second modality different from the first modality as the primary modality;
  
  dynamically generating a second confidence level as a function of updated environmental data associated with the input;
  
  attributing a second weight to the input as a function of the second confidence level;
  
  based on the second weight, determining that the first modality has become sufficient and re-engaging the input in the first modality as the primary modality;
  
  analyzing the input and the at least one other input;
  
  at least one of determining or inferring an intent associated with the input and the at least one other input based on the analyzing; and
  
  performing late fusion on the input and the at least one other input to integrate the input and the at least one other input at a semantic level.

13. A method comprising:
- parsing inputs received from a plurality of sources into surface semantics represented in a semantic representation by utilizing a language model, each of the plurality of sources corresponding to a different modality;
  
  providing environmental data associated with at least one of the inputs, the data comprising one or both of current data or historical data;
  
  adapting the language model to enhance accuracy of the parsing by utilizing the environmental data to compute at least one environmentally-specific conditional probability of at least one phrase of the inputs received from the plurality of sources;
  
  utilizing the semantic representation to generate discourse semantics;
  
  utilizing the discourse semantics to synthesize one or more responses to the inputs received from the plurality of sources;
  
  further comprising;
  
  generating, as a function of the environmental data, a confidence level for an intent associated with the inputs received from the plurality of sources;
  
  evaluating a first input of the inputs, the first input having a first modality initially employed as a primary modality;
  
  based on the generated confidence level, providing feedback to request a second input of the inputs having a second modality different from the first modality, andsubstituting the second modality for the first modality as the primary modality until the environmental data changes.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. The method of claim 13, further comprising utilizing one modality to complete or refine input associated with another modality.
  - 15. The method of claim 13, further comprising performing a mapping task associated with the inputs received from the plurality of sources, wherein performing the mapping task comprises:
    - computing a plurality of conditional probabilities of place names corresponding to the inputs based on heuristics, wherein the plurality of conditional probabilities comprise the at least one environmentally-specific conditional probability;
      
      organizing the place names at a global level and local level for at least partly including in a recognition grammar;
      
      pre-building and caching a local list of the place names corresponding to the local level; and
      
      prefixing the recognition grammar with a single category of the place names.
  - 16. The method of claim 13, wherein the semantic representation comprises a common modality-independent semantic representation.
  - 17. The method of claim 13, wherein adapting the language model enhances speech recognition accuracy of at least one or more parsers utilized for parsing the inputs.
  - 18. The method of claim 13, wherein to generate the discourse semantics comprises:
    - identifying one or more correlations among the inputs;
      
      interpreting the inputs based on the environmental data; and
      
      based on the one or more identified correlations and the interpreted inputs, determining or inferring an intent associated with the inputs received from the plurality of sources.
  - 19. A computer-readable storage medium having stored thereon computer executable components for carrying out the method of claim 13.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Yu, Dong, Deng, Li
Primary Examiner(s)
Smits, Talivaldis Ivars
Assistant Examiner(s)
ROBERTS, SHAUN A

Application Number

US11/686,722
Publication Number

US 20080228496A1
Time in Patent Office

1,944 Days
Field of Search

704/275, 704/270
US Class Current

704/275
CPC Class Codes

G06F 2203/0381   Multimodal input, i.e. inte...

G06F 3/038   Control and interface arran...

G10L 15/24   Speech recognition using no...

Speech-centric multimodal user interface design in mobile technology

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Speech-centric multimodal user interface design in mobile technology

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links