Methods and apparatus for implementing distributed multi-modal applications

US 8,370,160 B2
Filed: 12/31/2007
Issued: 02/05/2013
Est. Priority Date: 12/31/2007
Status: Active Grant

First Claim

Patent Images

1. A method performed by a client device, the method comprising the steps of:

rendering a visual display that includes at least one multi-modal display element for which input data is receivable by the client device through a visual modality and a voice modality, wherein the client device maintains knowledge of a visual view focus, which initially is set to a first multi-modal display element of the at least one multi-modal display element;

sending a first voice event request to an application server to establish a connection between the client device and the application server, wherein the first voice event request is an asynchronous hypertext transfer protocol (HTTP) request that will remain pending at the application server until a voice event occurs so that the connection remains established;

after sending the first voice event request, receiving an audio signal that may represent a user utterance via the voice modality;

sending uplink audio data representing the audio signal to a speech recognizer that interprets the uplink audio data based on a voice view focus, wherein the voice view focus initially is set to a portion of a speech dialog associated with the first multi-modal display element;

receiving a voice event response from the application server in response to the first voice event request and in response to the application server having received an indication that the voice event has occurred;

in response to receiving the voice event response, updating the visual view focus to a new visual view focus; and

sending a second voice event request to the application server in response to receiving the voice event response, wherein the second voice event request will remain pending at the application server until a second voice event occurs.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments include methods and apparatus for synchronizing data and focus between visual and voice views associated with distributed multi-modal applications. An embodiment includes a client device adapted to render a visual display that includes at least one multi-modal display element for which input data is receivable though a visual modality and a voice modality. When the client detects a user utterance via the voice modality, the client sends uplink audio data representing the utterance to a speech recognizer. An application server receives a speech recognition result generated by the speech recognizer, and sends a voice event response to the client. The voice event response is sent as a response to an asynchronous HTTP voice event request previously sent to the application server by the client. The client may then send another voice event request to the application server in response to receiving the voice event response.

23 Citations

View as Search Results

21 Claims

1. A method performed by a client device, the method comprising the steps of:
- rendering a visual display that includes at least one multi-modal display element for which input data is receivable by the client device through a visual modality and a voice modality, wherein the client device maintains knowledge of a visual view focus, which initially is set to a first multi-modal display element of the at least one multi-modal display element;
  
  sending a first voice event request to an application server to establish a connection between the client device and the application server, wherein the first voice event request is an asynchronous hypertext transfer protocol (HTTP) request that will remain pending at the application server until a voice event occurs so that the connection remains established;
  
  after sending the first voice event request, receiving an audio signal that may represent a user utterance via the voice modality;
  
  sending uplink audio data representing the audio signal to a speech recognizer that interprets the uplink audio data based on a voice view focus, wherein the voice view focus initially is set to a portion of a speech dialog associated with the first multi-modal display element;
  
  receiving a voice event response from the application server in response to the first voice event request and in response to the application server having received an indication that the voice event has occurred;
  
  in response to receiving the voice event response, updating the visual view focus to a new visual view focus; and
  
  sending a second voice event request to the application server in response to receiving the voice event response, wherein the second voice event request will remain pending at the application server until a second voice event occurs.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein sending the uplink audio data to the speech recognizer comprises sending the uplink audio data to the application server to be forwarded to the speech recognizer.
  - 3. The method of claim 1, wherein sending the uplink audio data to the speech recognizer comprises sending the uplink audio data directly to the speech recognizer.
  - 4. The method of claim 1, further comprising:
    - receiving a speech recognition result; and
      
      updating the visual display to display text corresponding to the speech recognition result in the first multi-modal display element.
  - 5. The method of claim 1, further comprising:
    - receiving an indication of a new voice view focus; and
      
      updating the visual view focus to be synchronized with the new voice view focus.
  - 6. The method of claim 1, further comprising:
    - requesting a multi-modal page from the application server, wherein the multi-modal page, when interpreted, enables the client device to render the visual display;
      
      receiving the multi-modal page from the application server;
      
      determining whether the client device is multi-modal enabled; and
      
      when the client device is multi-modal enabled, rendering the visual display.
  - 7. The method of claim 1, further comprising:
    - receiving downlink audio data, wherein the downlink audio data includes an audio prompt; and
      
      outputting the audio prompt on an audio output device of the client device.
  - 8. The method of claim 1, further comprising:
    - receiving a user input to change the visual view focus to a second multi-modal display element of the at least one multi-modal display element;
      
      based on receiving the user input, issuing a focus request to the application server indicating a new visual view focus corresponding to the second multi-modal display element;
      
      receiving a focus response from the application server; and
      
      updating the visual view focus and the visual display, in response to receiving the focus response, to indicate the second multi-modal display element as the visual view focus.
  - 9. The method of claim 8, wherein receiving the user input comprises:
    - receiving an indication that the user has selected the second multi-modal display element using a pointing or scrolling user interface device.
  - 10. The method of claim 8, wherein receiving the user input comprises:
    - receiving an indication that the user has entered text into a data entry field for the first multi-modal display element, wherein the focus request includes a representation of the text.

11. A method performed by an application server, the method comprising the steps of:
- receiving, from a client device that has rendered a visual display that includes at least one multi-modal display element for which input data is receivable by the client device though a visual modality and a voice modality, a first voice event request to establish a connection between the client device and the application server, wherein the first voice event request is an asynchronous hypertext transfer protocol (HTTP) request that will remain pending at the application server until a voice event occurs so that the connection remains established;
  
  after the first voice event request is received, receiving a speech recognition result from a voice server, wherein the speech recognition result represents a result of a speech recognition process performed on uplink audio data sent by the client device to a speech recognizer that interprets the uplink audio data based on a voice view focus, wherein the voice view focus initially is set to a portion of a speech dialog associated with a first multi-modal display element of the at least one multi-modal display element;
  
  sending a voice event response to the client device in response to the first voice event request and in response to the application server having received the speech recognition result, wherein the voice event response causes the client device to update a visual view focus; and
  
  receiving a second voice event request from the client device in response to sending the voice event response, wherein the second voice event request will remain pending at the application server until a second voice event occurs.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The method of claim 11, further comprising:
    - receiving the uplink audio data from the client device; and
      
      sending the uplink audio data to the speech recognizer.
  - 13. The method of claim 11, wherein sending the voice event response to the client device comprises:
    - including the speech recognition result in the voice event response.
  - 14. The method of claim 11, further comprising:
    - receiving an indication of a new voice view focus from the voice server; and
      
      including the indication of the new voice view focus in the voice event response.
  - 15. The method of claim 11, further comprising:
    - receiving an indication of a new visual view focus from the client device; and
      
      sending the indication of the new visual view focus to the voice server.

16. A system comprising:
- a client device adapted torender a visual display that includes at least one multi-modal display element for which input data is receivable by the client device though a visual modality and a voice modality, wherein the client device maintains knowledge of a visual view focus which initially is set to a first multi-modal display element of the at least one multi-modal display element,send a first voice event request to an application server to establish a connection between the client device and the application server before an audio signal is received via the voice modality, wherein the first voice event request is an asynchronous hypertext transfer protocol (HTTP) request that will remain pending at the application server until a voice event occurs so that the connection remains established,receive an audio signal that may represent a user utterance via the voice modality,send uplink audio data representing the audio signal to a speech recognizer that interprets the uplink audio data based on a voice view focus, wherein the voice view focus initially is set to a portion of a speech dialog associated with the first multi-modal display element,receive a voice event response from the application server in response to the first voice event request and in response to the application server having received an indication that the voice event has occurred,in response to receiving the voice event response, update the visual view focus to a new visual view focus, andsend a second voice event request to the application server in response to receiving the voice event response, wherein the second voice event request will remain pending at the application server until a second voice event occurs.
- View Dependent Claims (17, 18, 19, 20, 21)
- - 17. The system of claim 16, further comprising:
    - the application server, wherein the application server is adapted toreceive the first voice event request from the client device,receive a speech recognition result from a voice server, wherein the speech recognition result represents a result of a speech recognition process performed on the uplink audio data sent by the client device to the speech recognizer,send the voice event response to the client device in response to the first voice event request, andreceive the second voice event request from the client device in response to sending the voice event response.
  - 18. The system of claim 16, wherein the client device is further adapted to send the uplink audio data to the application server to be forwarded to the speech recognizer.
  - 19. The system of claim 16, wherein the client device is further adapted to receive an indication of a new voice view focus, and to update the visual view focus to be synchronized with the new voice view focus.
  - 20. The system of claim 16, wherein the client device is further adapted toreceive a user input to change the visual view focus to a second multi-modal display element of the at least one multi-modal display element, and based on receiving the user input, to issue a focus request to the application server indicating a new visual view focus corresponding to the second multi-modal display element, to receive a focus response from the application server, and to update the visual view focus and the visual display, in response to receiving the focus response, to indicate the second multi-modal display element as the visual view focus.
  - 21. The system of claim 16, wherein the client device is a device selected from a group of devices that includes a cellular telephone, a radio, a pager, a personal data assistant, a personal navigation device, a mobile computer system, an automotive computer system, an airplane computer system, a computer, a laptop computer, a notebook computer, a desktop computer, and a voice over internet protocol (VoIP) phone implemented on a computer.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google Technology Holdings LLC (Alphabet Inc.)
Original Assignee
Motorola Mobility LLC (Lenovo Group Ltd.)
Inventors
Pearce, Michael D., Engelsma, Jonathan R., Ferrans, James C.
Primary Examiner(s)
Smits, Talivaldis Ivars
Assistant Examiner(s)
ROBERTS, SHAUN A

Application Number

US11/967,356
Publication Number

US 20090171659A1
Time in Patent Office

1,863 Days
Field of Search

704/257, 704/270.1, 704/275
US Class Current

704/270.1
CPC Class Codes

G10L 15/24   Speech recognition using no...

G10L 15/30   Distributed recognition, e....

H04L 67/02   based on web technology, e....

H04L 67/04   specially adapted for termi...

H04L 67/75   Indicating network or usage...

H04M 2203/251   where a voice mode or a vis...

H04M 3/4938   comprising a voice browser ...

Methods and apparatus for implementing distributed multi-modal applications

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

23 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for implementing distributed multi-modal applications

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

23 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links