Techniques for disambiguating speech input using multimodal interfaces

US RE44,418 E1
Filed: 03/23/2012
Issued: 08/06/2013
Est. Priority Date: 12/10/2002
Status: Active Grant

First Claim

Patent Images

1. A system for disambiguating speech input using one of voice mode interaction, visual mode interaction, or a combination of voice mode interaction and visual mode interaction with an application comprising:

a speech disambiguation mechanism resident on one of an end user device and a remote server, and accessed through said end user device possessing multimodal user interfaces, said speech disambiguation mechanism comprising;

an options and parameters component for receiving and storing user parameters and receiving application parameters for controlling the speech disambiguation mechanism, wherein the speech disambiguation mechanism is controlled by parameters set by the user and parameters set by the application, and wherein the parameters include confidence thresholds governing unambiguous recognition and close matches;

a speech recognition component that receives recorded audio, speech input or a combination of the recorded audio and the speech input through one of said multimodal user interfaces, and generates;

a plurality of tokens corresponding to disambiguated words for presentation to the user; and

for each of the one or more tokens, a confidence value indicative of the likelihood that a given token correctly represents the speech input;

a selection component that identifies, according to a selection algorithm, two or more of the tokens to be presented to the user;

one or more disambiguation components directing one or more of said multimodal user interfaces to present the alternatives to the user in one of voice mode, visual mode, or a combination of the voice mode and the visual mode, and directing the multimodal user interfaces to receive an alternative selected by the user in one of the voice mode, the visual mode, or a combination of the voice mode and the visual mode; and

an output interface for communicating the selected alternative without translation of the speech input to the application as input.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique is disclosed for disambiguating speech input for multimodal systems by using a combination of speech and visual I/O interfaces. When the user'"'"'s speech input is not recognized with sufficiently high confidence, a the user is presented with a set of possible matches using a visual display and/or speech output. The user then selects the intended input from the list of matches via one or more available input mechanisms (e.g., stylus, buttons, keyboard, mouse, or speech input). These techniques involve the combined use of speech and visual interfaces to correctly identify user'"'"'s speech input. The techniques disclosed herein may be utilized in computer devices such as PDAs, cellphones, desktop and laptop computers, tablet PCs, etc.

Citations

25 Claims

1. A system for disambiguating speech input using one of voice mode interaction, visual mode interaction, or a combination of voice mode interaction and visual mode interaction with an application comprising:
- a speech disambiguation mechanism resident on one of an end user device and a remote server, and accessed through said end user device possessing multimodal user interfaces, said speech disambiguation mechanism comprising;
  
  an options and parameters component for receiving and storing user parameters and receiving application parameters for controlling the speech disambiguation mechanism, wherein the speech disambiguation mechanism is controlled by parameters set by the user and parameters set by the application, and wherein the parameters include confidence thresholds governing unambiguous recognition and close matches;
  
  a speech recognition component that receives recorded audio, speech input or a combination of the recorded audio and the speech input through one of said multimodal user interfaces, and generates;
  
  a plurality of tokens corresponding to disambiguated words for presentation to the user; and
  
  for each of the one or more tokens, a confidence value indicative of the likelihood that a given token correctly represents the speech input;
  
  a selection component that identifies, according to a selection algorithm, two or more of the tokens to be presented to the user;
  
  one or more disambiguation components directing one or more of said multimodal user interfaces to present the alternatives to the user in one of voice mode, visual mode, or a combination of the voice mode and the visual mode, and directing the multimodal user interfaces to receive an alternative selected by the user in one of the voice mode, the visual mode, or a combination of the voice mode and the visual mode; and
  
  an output interface for communicating the selected alternative without translation of the speech input to the application as input.
- View Dependent Claims (2, 3, 4, 5, 10)
- - 2. The system of claim 1, wherein the one or more disambiguation components perform said interaction by presenting the user with alternatives in a visual mode, and by receiving the user'"'"'s selection in a visual mode.
  - 3. The system of claim 2, wherein the disambiguation components present the alternatives to the user in a visual form and allow the user to select from among the alternatives using a voice input.
  - 4. The system of claim 1, wherein the selection component filters the one or more tokens according to a set of parameters.
  - 5. The system of claim 4, wherein the set of parameters is user specified.
  - 10. The system of claim 1 further comprises comprising a communication network, wherein the options and parameters component, the speech recognition component, the selection component, the one or more disambiguation components, and the output interface of the speech disambiguation mechanism are distributed on said communication network.

6. A method of processing speech input using one of voice mode interaction, visual mode interaction, or a combination of voice mode and visual mode interaction with an application comprising:
- receiving and storing user parameters and receiving application parameters for controlling a speech disambiguation mechanism, wherein said speech disambiguation mechanism is resident on one of an end user device and a remote server, and accessed through said end user device possessing multimodal user interfaces;
  
  , andreceiving and storing user parameters and receiving application parameters for controlling the speech disambiguation mechanism, wherein both the user and the application can set the parameters to control said speech disambiguation mechanism, and wherein the parameters include confidence thresholds governing unambiguous recognition and close matches;
  
  receiving a speech input from the user through one of said multimodal user interfaces;
  
  determining whether the speech input is ambiguous;
  
  if the speech input is not ambiguous, communicating a token representative of the speech input to the application as input to the application; and
  
  if the speech input is ambiguous;
  
  selecting two or more tokens and presenting the tokens as alternatives to the user;
  
  directing the multimodal user interfaces to present the alternatives to the user in one of voice mode, visual mode, or a combination of the voice mode and the visual mode, and to present a selection of an alternative from the user from the plurality of alternatives presented to the user in one of the voice mode, the visual mode, or a combination of the voice mode and the visual mode; and
  
  communicating the selected alternative without translation of the speech input as input to the application.
- View Dependent Claims (7, 8, 9)
- - 7. The method of claim 6, where the interaction comprises the concurrent use of said visual mode and said voice mode.
  - 8. The method of claim 7, wherein the interaction comprises the user selecting from among the plural alternatives using a combination of speech and visual-based input.
  - 9. The method of claim 6, wherein the interaction comprises the user selecting from among the plural alternatives using visual input.

11. A method of processing speech input using one of voice mode interaction, visual mode interaction, or a combination of voice mode and visual mode interaction with an application comprising:
- receiving and storing user parameters and receiving application parameters for controlling a speech disambiguation mechanism, wherein said speech disambiguation mechanism is resident on a remote server, and accessed over a communication network using an end user device possessing multimodal user interfaces;
  
  andreceiving and storing user parameters and receiving application parameters for controlling the speech disambiguation mechanism, wherein both the user and the application set the parameters to control said speech disambiguation mechanism, and wherein the parameters include confidence thresholds governing unambiguous recognition and close matches;
  
  receiving a speech input from the user through one of said multimodal user interfaces;
  
  determining whether the speech input is ambiguous;
  
  if the speech input is not ambiguous, communicating a token representative of the speech input to the application as input to the application; and
  
  if the speech input is ambiguous;
  
  selecting two or more tokens and presenting the tokens as alternatives to the user;
  
  directing the multimodal user interfaces to present the alternatives to the user in one of voice mode, visual mode, or a combination of the voice mode and the visual mode, and to present a selection of an alternative from the user from the plurality of alternatives presented to the user in one of the voice mode, the visual mode, or a combination of the voice mode and the visual mode; and
  
  communicating the selected alternative without translation of the speech input as input to the application.
- View Dependent Claims (12, 13, 14)
- - 12. The method of claim 11, where the interaction comprises the concurrent use of said visual mode and said voice mode.
  - 13. The method of claim 12, wherein the interaction comprises the user selecting from among the plural alternatives using a combination of speech and visual-based input.
  - 14. The method of claim 11, wherein the interaction comprises the user selecting from among the plural alternatives using visual input.

15. A computing device configured to disambiguate speech data, the computing device comprising:
- a speech disambiguation component configured to be accessible via a user interface, said speech disambiguation component comprising;
  
  an options and parameters component configured to;
  
  receive and store user parameters; and
  
  receive application parameters for controlling the speech disambiguation component, wherein the speech disambiguation component is controlled based on in part parameters set by the user and parameters set by the application, wherein the application parameters include confidence thresholds governing unambiguous recognition and close matches;
  
  a speech recognition component configured to receive recorded audio, speech input or a combination of the recorded audio and the speech input through said user interface, the speech recognition component further configured to generate;
  
  a plurality of tokens corresponding to disambiguated words for presentation to the user; and
  
  confidence values indicative of a likelihood that a given token correctly represents the speech input;
  
  a selection component configured to identify two or more of the tokens to be presented to the user;
  
  a disambiguation component configured to cause said user interface to present one or more alternatives to the user and receive a selection of one of the alternatives in one of the voice mode, the visual mode, or a combination of the voice mode and the visual mode; and
  
  an output interface for communicating the selection without translation of the speech input to the application as input.
- View Dependent Claims (16, 17, 18, 19)
- - 16. The system of claim 15, wherein the disambiguation component is configured to perform interaction by presenting the user with alternatives in a visual mode and by receiving the selection in a visual mode.
  - 17. The system of claim 16, wherein the disambiguation component is configured to render the alternatives to the user in a visual form and allow selection of the alternatives using a voice input.
  - 18. The system of claim 15, wherein the selection component is configured to filter the tokens according to a set of selection parameters.
  - 19. The system of claim 18, wherein the set of selection parameters is user specified.

20. A method of processing speech input, the method comprising:
- receiving and storing, on a computing device, user parameters;
  
  receiving, on the computing device, application parameters for controlling a speech disambiguation function, wherein the user and application parameters are used in part to control said speech disambiguation function, and wherein the user and application parameters include confidence thresholds governing unambiguous recognition and close matches;
  
  receiving, on the computing device, a speech input;
  
  determining, on the computing device, whether the speech input is ambiguous;
  
  if the speech input is not ambiguous, communicating data representative of the speech input to an application as input to the application; and
  
  if the speech input is ambiguous;
  
  selecting, by the computing device, two or more alternatives representative of the speech input and presenting the alternatives to the user;
  
  presenting, by the computing device, the alternatives to the user in one of a voice mode, visual mode, or a combination of the voice mode and the visual mode and receiving a selection one of the alternatives; and
  
  communicating the received selection without translation of the speech input as input to the application.
- View Dependent Claims (21, 22)
- - 21. The method of claim 20, wherein the speech input is processed using the visual mode interaction.
  - 22. The method of claim 21, wherein said presenting is performed using a combination of speech and visual-based modes.

23. A method of processing speech input, the method comprising:
- receiving and storing speech disambiguation parameters, wherein the speech disambiguation parameters include user-defined and application-defined parameters and wherein the speech disambiguation parameters include confidence thresholds pertaining to speech recognition ambiguity;
  
  receiving a speech input;
  
  determining whether the speech input is ambiguous based on the speech disambiguation parameters;
  
  if the speech input is not ambiguous, communicating a token representative of the speech input to an application as input to the application; and
  
  if the speech input is ambiguous;
  
  selecting two or more tokens representative of the speech input and presenting the tokens as alternatives to the user in one of a voice mode, visual mode, or a combination of the voice mode and the visual mode and receiving a selection of an alternative from the user from the plurality of alternatives presented to the user; and
  
  communicating the selection without translation of the speech input as input to the application.
- View Dependent Claims (24)
- - 24. The method of claim 23, wherein said presenting is performed using a combination of speech and visual-based modes.

25. A computer readable storage medium comprising computer readable instructions, the medium comprising:
- instructions for receiving and storing speech disambiguation parameters, wherein the speech disambiguation parameters include user-defined and application-defined parameters and wherein the speech disambiguation parameters include confidence thresholds pertaining to speech recognition ambiguity;
  
  instructions for receiving a speech input;
  
  instructions for determining whether the speech input is ambiguous based on the speech disambiguation parameters;
  
  instructions for, if the speech input is not ambiguous, communicating a token representative of the speech input to an application as input to the application; and
  
  instructions for, if the speech input is ambiguous;
  
  selecting two or more tokens representative of the speech input and presenting the tokens as alternatives to the user in one of a voice mode, visual mode, or a combination of the voice mode and the visual mode and receiving a selection of an alternative from the user from the plurality of alternatives presented to the user; and
  
  communicating the selection without translation of the speech input as input to the application.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Gula Consulting Limited Liability Company (Intellectual Ventures LLC)
Original Assignee
Waloomba Tech Limited LLC (Intellectual Ventures LLC)
Inventors
Dominach, Richard F., Sibal, Sandeep, Isukapalli, Sastry, Vaidya, Shirish
Primary Examiner(s)
Smits, Talivaldis Ivars

Application Number

US13/429,187
Time in Patent Office

501 Days
Field of Search

704/9, 704/235, 704/251, 704/270
US Class Current

704/235
CPC Class Codes

G10L 15/22 Procedures used during a sp...

Techniques for disambiguating speech input using multimodal interfaces

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Techniques for disambiguating speech input using multimodal interfaces

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links