USE OF METADATA TO POST PROCESS SPEECH RECOGNITION OUTPUT

US 20090248415A1
Filed: 03/31/2009
Published: 10/01/2009
Est. Priority Date: 03/31/2008
Status: Active Grant

First Claim

Patent Images

1. A method of utilizing metadata stored in a computer-readable medium to assist in the conversion of spoken audio input, received by a hand-held mobile communication device, into a textual representation for display on the hand-held mobile communication device, comprising the steps of:

initializing a hand-held mobile communication device so that the hand-held mobile communication device is capable of communicating with a backend server via a data channel of the hand-held mobile communication device;

upon receipt of an utterance by the hand-held mobile communication device, recording and storing an audio message, representative of the utterance, in the hand-held mobile communication device in the form of binary audio data;

transmitting, via the data channel, the recorded and stored binary audio data, representing the utterance, from the hand-held mobile communication device to a backend server through a client-server communication protocol;

in conjunction with the transmission of the recorded and stored binary audio data, transmitting metadata from the hand-held mobile communication device to the backend server through the client-server communication protocol;

converting the transmitted binary audio data into a textual representation of the utterance in the backend server;

comparing at least one portion of the textual representation to at least one portion of the metadata;

replacing at least one portion of the textual representation with at least one portion of the metadata; and

sending the converted textual representation of the utterance, with metadata replacement, from the server back to the hand-held mobile communication device.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of utilizing metadata stored in a computer-readable medium to assist in the conversion of an audio stream to a text stream. The method compares personally identifiable data, such as a user'"'"'s electronic address book and/or Caller/Recipient ID information (in the case of processing voice mail to text), to the n-best results generated by a speech recognition engine for each word that is output by the engine. A goal of this comparison is to correct a possible misrecognition of a spoken proper noun such as a name or company with its proper textual form or a spoken phone number to correctly formatted phone number with Arabic numerals to improve the overall accuracy of the output of the voice recognition system.

227 Citations

44 Claims

1. A method of utilizing metadata stored in a computer-readable medium to assist in the conversion of spoken audio input, received by a hand-held mobile communication device, into a textual representation for display on the hand-held mobile communication device, comprising the steps of:
- initializing a hand-held mobile communication device so that the hand-held mobile communication device is capable of communicating with a backend server via a data channel of the hand-held mobile communication device;
  
  upon receipt of an utterance by the hand-held mobile communication device, recording and storing an audio message, representative of the utterance, in the hand-held mobile communication device in the form of binary audio data;
  
  transmitting, via the data channel, the recorded and stored binary audio data, representing the utterance, from the hand-held mobile communication device to a backend server through a client-server communication protocol;
  
  in conjunction with the transmission of the recorded and stored binary audio data, transmitting metadata from the hand-held mobile communication device to the backend server through the client-server communication protocol;
  
  converting the transmitted binary audio data into a textual representation of the utterance in the backend server;
  
  comparing at least one portion of the textual representation to at least one portion of the metadata;
  
  replacing at least one portion of the textual representation with at least one portion of the metadata; and
  
  sending the converted textual representation of the utterance, with metadata replacement, from the server back to the hand-held mobile communication device.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41)
- - 2. The method of claim 1, further comprising a step of processing the converted textual representation of the utterance with a Digit Filter that substitutes Arabic numerals for words representing particular numbers or that are homophones of words representing particular numbers.
  - 3. The method of claim 2, wherein the step of processing the converted textual representation of the utterance with a Digit Filter occurs before the comparing and replacing steps.
  - 4. The method of claim 2, wherein the step of processing the converted textual representation of the utterance with a Digit Filter occurs after the comparing and replacing steps.
  - 5. The method of claim 1, further comprising a step of processing the converted textual representation of the utterance with a Telephone Filter that formats the Arabic numerals of a telephone number into a conventional format.
  - 6. The method of claim 5, wherein the step of processing the converted textual representation of the utterance with a Telephone Filter occurs before the comparing and replacing steps.
  - 7. The method of claim 5, wherein the step of processing the converted textual representation of the utterance with a Telephone Filter occurs after the comparing and replacing steps.
  - 8. The method of claim 1, further comprising a step of forwarding the converted textual representation, with metadata replacement, from the hand-held mobile communication device to one or more recipients.
  - 9. The method of claim 1, further comprising a step of displaying the converted textual representation, with metadata replacement, on the hand-held mobile communication device.
  - 10. The method of claim 1, wherein the client device comprises a mobile phone.
  - 11. The method of claim 1, wherein the client-server communication protocol is HTTP and/or HTTPS.
  - 12. The method of claim 1, wherein the client-server communication protocol is UDP.
  - 13. The method of claim 1, wherein the backend server comprises an automatic speech recognition (ASR) engine.
  - 14. The method of claim 13, wherein the ASR engine utilizes a speech recognition algorithm.
  - 15. The method of claim 14, wherein the speech recognition algorithm comprises a grammar algorithm and/or a transcription algorithm.
  - 16. The method of claim 13, wherein the backend server includes a text-to-speech engine (TTS) for generating an audio message from a text message.
  - 17. The method of claim 1, wherein the text stream comprises a highest-confidence string, and at least one alternative string.
  - 18. The method of claim 1, wherein the metadata is stored on a mobile phone.
  - 19. The method of claim 1, wherein the metadata is stored in an address book.
  - 20. The method of claim 1, wherein the metadata is an alphanumeric string.
  - 21. The method of claim 1, wherein the metadata is stored in a contact list.
  - 22. The method of claim 1, wherein the metadata is stored in a digital or electronic calendar.
  - 23. The method of claim 1, wherein the metadata is a collation of data stored in different locations.
  - 24. The method of claim 1, wherein the metadata is extracted from an incoming phone call.
  - 25. The method of claim 1, wherein the metadata is Caller ID data.
  - 26. The method of claim 1, wherein the metadata is Recipient ID data.
  - 27. The method of claim 1, wherein the metadata comprises Arabic numerals.
  - 28. The method of claim 1, wherein the binary audio data is a binary file.
  - 29. The method of claim 1, wherein the binary audio data is a .mp3 file.
  - 30. The method of claim 1, wherein the binary audio data is a .wav file.
  - 31. The method of claim 1, wherein the binary audio data and the metadata form a single data stream.
  - 32. The method of claim 31, wherein the data stream is compressed.
  - 33. The method of claim 31, wherein the data stream is encrypted.
  - 34. The method of claim 1, further comprising a step of displaying advertising messages and/or icons on the hand-held mobile communication device according to keywords contained in the textual representation of the utterance, wherein the keywords are associated with the advertising messages and/or icons.
  - 35. The method of claim 1, further comprising the additional step of locating a geospatial position of the hand-held mobile communication device through a global positioning system (GPS).
  - 36. The method of claim 35, further comprising the additional step of listing locations, proximate to the position of the client device, of a target of interest presented in the converted text stream.
  - 37. The method of claim 1, wherein the backend server comprises a plurality of applications.
  - 38. The method of claim 37, wherein the client device comprises a keypad having a plurality of buttons, configured such that each button is associated with one of the plurality of applications.
  - 39. The method of claim 37, wherein the client device comprises a user interface (UI) having a plurality of tabs configured so that each tab is associated with a plurality of user preferences.
  - 40. The method of claim 37, wherein the step of initializing the client device comprises the steps of:
    - (a) initializing a desired application from the client device; and
      
      (b) logging into a client account in the backend server from the client device.
  - 41. The method of claim 37, wherein the backend server includes an ad filter, an SMS filter, an obscenity filter, a number filter, a date filter, and a currency filter.

42. A method of utilizing metadata stored in a computer-readable medium to assist in the conversion of spoken audio input, received by a hand-held mobile communication device, into a textual representation for display on the hand-held mobile communication device, comprising the steps of:
- initializing a hand-held mobile communication device so that the hand-held mobile communication device is capable of communicating with a backend server via a data channel of the hand-held mobile communication device;
  
  upon receipt of an utterance by the hand-held mobile communication device, recording and storing an audio message, representative of the utterance, in the hand-held mobile communication device in the form of binary audio data;
  
  transmitting, via the data channel, the recorded and stored binary audio data, representing the utterance, from the hand-held mobile communication device to a backend server through a client-server communication protocol;
  
  storing metadata, associated with the hand-held mobile communication device, at the backend server;
  
  converting the transmitted binary audio data into a textual representation of the utterance in the backend server;
  
  comparing at least one portion of the textual representation at least one portion of the metadata;
  
  replacing at least one portion of the textual representation with at least one portion of the metadata; and
  
  sending the converted textual representation of the utterance, with metadata replacement, from the server back to the hand-held mobile communication device.
- View Dependent Claims (43)
- - 43. The method of claim 42, further comprising a step of forwarding the converted textual representation, with metadata replacement, from the hand-held mobile communication device to one or more recipients.

44. A system for utilizing metadata stored in a computer-readable medium to assist in the conversion of spoken audio input, received by a hand-held mobile communication device, into a textual representation for display on the hand-held mobile communication device, comprising:
- a hand-held mobile communication device;
  
  a backend server; and
  
  software in the hand-held mobile communication device and backend server for causing the hand-held mobile communication device and/or the backend server to perform functions comprising;
  
  initializing the hand-held mobile communication device so that the hand-held mobile communication device is capable of communicating with the backend server via a data channel of the hand-held mobile communication device;
  
  upon receipt of an utterance by the hand-held mobile communication device, recording and storing an audio message, representative of the utterance, in the hand-held mobile communication device in the form of binary audio data;
  
  transmitting, via the data channel, the recorded and stored binary audio data, representing the utterance, from the hand-held mobile communication device to a backend server through a client-server communication protocol;
  
  in conjunction with the transmission of the recorded and stored binary audio data, transmitting metadata from the hand-held mobile communication device to the backend server through the client-server communication protocol;
  
  converting the transmitted binary audio data into a textual representation of the utterance in the backend server;
  
  comparing at least one portion of the textual representation at least one portion of the metadata;
  
  replacing at least one portion of the textual representation with at least one portion of the metadata; and
  
  sending the converted textual representation of the utterance, with metadata replacement, from the server back to the hand-held mobile communication device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
YAP INC. (Amazon.com, Inc.)
Inventors
Strohofer, Clifford J., White, Marc, Jablokov, Igor Roditis, Jablokov, Victor Roditis

Granted Patent

US 8,676,577 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/251
CPC Class Codes

G10L 15/30 Distributed recognition, e....

G10L 2015/228 of application context

USE OF METADATA TO POST PROCESS SPEECH RECOGNITION OUTPUT

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

227 Citations

44 Claims

Specification

Solutions

Use Cases

Quick Links

USE OF METADATA TO POST PROCESS SPEECH RECOGNITION OUTPUT

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

227 Citations

44 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links