METHODS AND APPARATUS FOR IMPLEMENTING DISTRIBUTED MULTI-MODAL APPLICATIONS
First Claim
1. A method performed by a client device, the method comprising the steps of:
- rendering a visual display that includes at least one multi-modal display element for which input data is receivable by the client device through a visual modality and a voice modality, wherein a visual view focus is set to a first multi-modal display element of the at least one multi-modal display element;
sending a first voice event request to an application server, wherein the first voice event request is an asynchronous hypertext transfer protocol (HTTP) request;
receiving an audio signal that may represent a user utterance via the voice modality;
sending uplink audio data representing the audio signal to a speech recognizer;
receiving a voice event response from the application server in response to the voice event request; and
sending a second voice event request to the application server in response to receiving the voice event response.
4 Assignments
0 Petitions
Accused Products
Abstract
Embodiments include methods and apparatus for synchronizing data and focus between visual and voice views associated with distributed multi-modal applications. An embodiment includes a client device adapted to render a visual display that includes at least one multi-modal display element for which input data is receivable though a visual modality and a voice modality. When the client detects a user utterance via the voice modality, the client sends uplink audio data representing the utterance to a speech recognizer. An application server receives a speech recognition result generated by the speech recognizer, and sends a voice event response to the client. The voice event response is sent as a response to an asynchronous HTTP voice event request previously sent to the application server by the client. The client may then send another voice event request to the application server in response to receiving the voice event response.
-
Citations
21 Claims
-
1. A method performed by a client device, the method comprising the steps of:
-
rendering a visual display that includes at least one multi-modal display element for which input data is receivable by the client device through a visual modality and a voice modality, wherein a visual view focus is set to a first multi-modal display element of the at least one multi-modal display element; sending a first voice event request to an application server, wherein the first voice event request is an asynchronous hypertext transfer protocol (HTTP) request; receiving an audio signal that may represent a user utterance via the voice modality; sending uplink audio data representing the audio signal to a speech recognizer; receiving a voice event response from the application server in response to the voice event request; and sending a second voice event request to the application server in response to receiving the voice event response. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method performed by an application server, the method comprising the steps of:
-
receiving a first voice event request from a client device that has rendered a visual display that includes at least one multi-modal display element for which input data is receivable by the client device though a visual modality and a voice modality, wherein the first voice event request is an asynchronous hypertext transfer protocol (HTTP) request; receiving a speech recognition result from a voice server, wherein the speech recognition result represents a result of a speech recognition process performed on uplink audio data sent by the client device to a speech recognizer; sending a voice event response to the client device in response to the first voice event request; and receiving a second voice event request from the client device in response to sending the voice event response. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A system comprising:
a client device adapted to render a visual display that includes at least one multi-modal display element for which input data is receivable by the client device though a visual modality and a voice modality, wherein a visual view focus is set to a first multi-modal display element of the at least one multi-modal display element, send a first voice event request to an application server, wherein the first voice event request is an asynchronous hypertext transfer protocol (HTTP) request, receive an audio signal that may represent a user utterance via the voice modality, send uplink audio data representing the audio signal to a speech recognizer, receive a voice event response from the application server in response to the voice event request, and send a second voice event request to the application server in response to receiving the voice event response. - View Dependent Claims (17, 18, 19, 20, 21)
Specification