Combined speech recognition and text-to-speech generation

US 7,577,569 B2
Filed: 09/24/2004
Issued: 08/18/2009
Est. Priority Date: 09/05/2001
Status: Active Grant

First Claim

Patent Images

1. A computing device for performing large vocabulary speech recognition comprising:

processor readable memory;

one or more processors capable of executing program instructions read from said memory;

a microphone or audio input for providing an electronic signal representing an utterance to be recognized;

a speaker or audio output for enabling an electronic representation of sound produced in said device to be transduced into a corresponding sound;

programming recorded in the memory including;

speech recognition programming for performing large vocabulary speech recognition that responds to the electronic representations of a sequence of one or more utterances received from the microphone or audio input by producing a text output corresponding to the one or more words recognized as corresponding to the utterances; and

TTS programming for providing TTS output to said speaker or audio output saying one or more words of said text recognized by said speech recognition;

shared speech modeling data stored in said memory that is used by said speech recognition programming to recognize words corresponding to spoken utterances and by said TTS programming to generate sounds corresponding to the speaking of a sequence of one or more; and

wherein the computing device is capable of responding to text navigation commands by moving a cursor backward and forward in the one or more words of said text output, and responding to each movement in response to one of said text navigation commands by providing a TTS output to said sneaker or audio output saying one or more words either starting or ending with the location of the cursor after each of said movements.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Text-to-speech (TTS) generation is used in conjunction with large vocabulary speech recognition to say words selected by the speech recognition. The software for performing the large vocabulary speech recognition can share speech modeling data with the TTS software. TTS or recorded audio can be used to automatically say both recognized text and the names of recognized commands after their recognition. The TTS can automatically repeats text recognized by the speech recognition after each of a succession of end of utterance detections. A user can move a cursor back or forward in recognized text, and the TTS can speak one or more words at the cursor location after each such move. The speech recognition can be used to produces a choice list of possible recognition candidates and the TTS can be used to provide spoken output of one or more of the candidates on the choice list.

Citations

33 Claims

1. A computing device for performing large vocabulary speech recognition comprising:
- processor readable memory;
  
  one or more processors capable of executing program instructions read from said memory;
  
  a microphone or audio input for providing an electronic signal representing an utterance to be recognized;
  
  a speaker or audio output for enabling an electronic representation of sound produced in said device to be transduced into a corresponding sound;
  
  programming recorded in the memory including;
  
  speech recognition programming for performing large vocabulary speech recognition that responds to the electronic representations of a sequence of one or more utterances received from the microphone or audio input by producing a text output corresponding to the one or more words recognized as corresponding to the utterances; and
  
  TTS programming for providing TTS output to said speaker or audio output saying one or more words of said text recognized by said speech recognition;
  
  shared speech modeling data stored in said memory that is used by said speech recognition programming to recognize words corresponding to spoken utterances and by said TTS programming to generate sounds corresponding to the speaking of a sequence of one or more; and
  
  wherein the computing device is capable of responding to text navigation commands by moving a cursor backward and forward in the one or more words of said text output, and responding to each movement in response to one of said text navigation commands by providing a TTS output to said sneaker or audio output saying one or more words either starting or ending with the location of the cursor after each of said movements.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. A computing device as in claim 1 wherein said shared speech modeling data includes letter to sound rules for use in deriving phonetic spellings from the textual spellings of words or names.
  - 3. A computing device as in claim 1 wherein said shared speech modeling data includes a textual spelling and one or more corresponding phonetic spellings for each of at least two thousand vocabulary words.
  - 4. A computing device as in claim 3 wherein said stored textual and phonetic spelling data includes data indicating which of different phonetic spellings stored in correspondence with the textual spelling of each of certain ones of said vocabulary words is most likely appropriate when such a word occurs in a given linguistic context.
  - 5. A computing device as in claim 4 wherein:
    - the data indicating which of different phonetic spelling is most likely appropriate when a given vocabulary word occurs in a given linguistic context provides such indication based, at least in part, on the more likely part of speech associated with the occurrence of the given word; and
      
      said shared speech modeling data includes language modeling information indicating which parts of speech for one or more words are more likely to occur in a given language context.
  - 6. A computing device as in claim 1 wherein the device is a handheld device.
  - 7. A computing device as in claim 6 wherein the device is a cell phone.

8. A computing device for performing large vocabulary speech recognition comprising:
- computer readable memory;
  
  one or more processors capable of executing program instructions read from said memory;
  
  a microphone or audio input for providing an electronic signal representing an utterance to be recognized;
  
  a speaker or audio output for enabling an electronic representation of sound produced in said device to be transduced into a corresponding sound; and
  
  programming recorded in the memory including instructions for;
  
  performing large vocabulary speech recognition upon an electronic representations of utterances received from the microphone or audio input, including responding to certain utterances as text words which are supplied to a text output and responding to other utterances as a recognized commands;
  
  providing TTS output to said speaker or audio output saying one or more words of said text output; and
  
  providing TTS or recorded audio output to said speaker or audio output saying the name of a recognized command.
- View Dependent Claims (9, 10)
- - 9. A computing device as in claim 8 wherein the device is a handheld device.
  - 10. A computing device as in claim 9 wherein the device is a cell phone.

11. A computing device for performing large vocabulary speech recognition comprising:
- computer readable memory;
  
  one or more processors capable of executing program instructions read from said memory;
  
  a microphone or audio input for providing an electronic signal representing an utterance to be recognized;
  
  a speaker or audio output for enabling an electronic representation of sound produced in said device to be transduced into a corresponding sound; and
  
  programming recorded in the memory including instructions for;
  
  performing large vocabulary speech recognition that responds to the electronic representations of each of a sequence of one or more utterances received from the microphone or audio input by;
  
  selecting as a best scoring recognition candidate the one or more words recognized by the speech recognition as corresponding to the utterance;
  
  detecting the end of the utterance; and
  
  thenresponding to the detection of the end of utterance by providing TTS output to said speaker or audio output saying the one or more words of said best scoring recognition candidate for the utterancewhereby the device can generate audio feedback on the one or more words recognized for each of a succession of large vocabulary speech utterances at the end of each such utterance.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. A computing device as in claim 11 wherein said speech recognition is discrete speech recognition and said TTS output says the text word which is recognized in response to each utterance.
  - 13. A computing device as in claim 11 wherein said speech recognition is continuous speech recognition and said TTS output says the one or more text words recognized in response to each utterance after the end of the utterance.
  - 14. A computing device as in claim 11 wherein the device is a handheld device.
  - 15. A computing device as in claim 14 wherein the device is a cell phone.
  - 16. A computing device as in claim 11 wherein:
    - said device has a display;
      
      said recorded programming instructions include instructions for;
      
      causing said best scoring recognition candidates to be shown on saiddisplay as said utterances are recognized; and
      
      enabling a user to select whether or not to have said audio feedback generated at the end of each such utterance.

17. A computing device for performing large vocabulary speech recognition comprising:
- computer readable memory;
  
  one or more processors capable of executing program instructions read from said memory;
  
  a microphone or audio input for providing an electronic signal representing an utterance to be recognized;
  
  a speaker or audio output for enabling an electronic representation of sound produced in said device to be transduced into a corresponding sound; and
  
  programming recorded in the memory including instructions for;
  
  performing larger vocabulary speech recognition upon an electronic representation of utterances received from the microphone or audio input to produce a text output;
  
  responding to text navigation commands by moving a cursor backward and forward in the one or more words of said text output; and
  
  responding to each movement in response to one of said navigational commands by providing a TTS output to said speaker or audio output saying one or more words either starting or ending with the location of the cursor after each of said movements.
- View Dependent Claims (18, 19, 20, 21, 22)
- - 18. A computing device as in claim 17 wherein said programming further includes instructions for responding to a selection extension command by:
    - recording the cursor location at the time the command is received as a selection start;
      
      starting a selection at the selection start; and
      
      entering a selection extension mode in which the response to one of said navigational commands further includes causing the selection to extend from the selection start to the cursor location after the cursor movement made in response to said navigation command.
  - 19. A computing device as in claim 18 wherein said programming further includes instructions for responding to a play selection command by providing a TTS output to said speaker or audio output saying the one or more words that are currently in the selection.
  - 20. A computing device as in claim 17 wherein said saying of one or more words starts speaking words of said text starting at the current cursor position and continues speaking them until an end of a unit of text larger than a word is reached or until a user input is received to terminate such playback.
  - 21. A computing device as in claim 17 wherein the device is a handheld device.
  - 22. A computing device as in claim 21 wherein the device is a cell phone.

23. A computing device for performing large vocabulary speech recognition comprising:
- computer readable memory;
  
  one or more processors capable of executing program instructions read from said memory;
  
  a microphone or audio input for providing an electronic signal representing an utterance to be recognized;
  
  a speaker or audio output for enabling an electronic representation of sound produced in said device to be transduced into a corresponding sound;
  
  programming recorded in the memory including instructions for;
  
  performing large vocabulary speech recognition upon an electronic representations of uttered words received from the microphone or audio input to produce a choice list of recognition candidates, each comprised of a sequence of one or more words, selected by the recognition as scoring best against said uttered sound;
  
  using text-to-speech technology to provide spoken output to said speaker or audio output saying a plurality of the recognition candidates in the choice list;
  
  enabling the user to select one recognition candidates from among the plurality of such candidates said by said text-to-speech technology.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 24. A computing device as in claim 23 wherein said programming includes instructions for:
    - responding to choice navigation commands by changing which of the recognition candidates in the list of choices is currently selected; and
      
      responding to each change in the currently selected recognition candidate in response to one of said navigational commands by causing said text-to-speech technology to provide spoken output saying the one or more words in the recognition candidate that is currently selected after said change.
  - 25. A computing device as in claim 23 wherein:
    - said text-to-speech technology says the words of a plurality of recognition candidates in said list and contains a spoken indication of a choice input signal associated with each of said plurality of commands; and
      
      said programming further includes instructions for responding to receipt of one of said choice input signal by selecting the associated recognition candidate as the output for said uttered sound.
  - 26. A computing device as in claim 25 wherein:
    - said device has a telephone keypad; and
      
      said choice input signals include phone key numbers; and
      
      said responding to receipt of one of said choice input signal includes responding to the pressing of numbered phone keys as said choice input signals.
  - 27. A computing device as in claim 25 wherein said text-to-speech technology says the best scoring recognition candidate first.
  - 28. A computing device as in claim 23 wherein said programming includes instructions for responding to the receipt of filtering input by:
    - producing a filtered choice list of filtered recognition candidates, each comprised of a sequence of one or more words that agree with said filtering input and which have been selected by the recognition based on recognition scores against said uttered sound; and
      
      using said text-to-speech technology to provide spoken output to said speaker or audio output saying the one or more words of one of the recognition candidates in the filtered choice list.
  - 29. A computing device as in claim 28 wherein said programming further includes instructions for using said text-to-speech technology to provide spoken output saying the current value of the filter.
  - 30. A computing device as in claim 29 wherein the filtering input is a sequence of letters and said text-to-speech spoken output says the letters in the filter sequence.
  - 31. A computing device as in claim 23 wherein the text-to-speech spoken output includes the spelling of one or more choices.
  - 32. A computing device as in claim 23 wherein the device is a handheld device.
  - 33. A computing device as in claim 32 wherein the device is a cell phone.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Voice Signal Technologies Incorporated (Microsoft Corporation)
Inventors
Grabherr, Manfred G., Roth, Daniel L., Johnston, David F., Cohen, Jordan R., Porter, Edward W.
Primary Examiner(s)
Chawan; Vijay B

Application Number

US10/949,991
Publication Number

US 20050038657A1
Time in Patent Office

1,789 Days
Field of Search

704/235, 704/260, 704/270, 704/251, 704/231, 704/256, 704/240, 704/255
US Class Current

704/260
CPC Class Codes

G10L 13/08   Text analysis or generation...

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/19   Grammatical context, e.g. d...

Combined speech recognition and text-to-speech generation

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

Combined speech recognition and text-to-speech generation

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links