Determining text to speech pronunciation based on an utterance from a user

US 8,275,621 B2
Filed: 05/18/2011
Issued: 09/25/2012
Est. Priority Date: 03/31/2008
Status: Active Grant

First Claim

Patent Images

1. A speech-based system comprising:

at least one storage device that stores;

an input text comprising a plurality of words of a first language;

information indicative of a first pronunciation of a first word of the plurality of words of the first language and information indicative of a first pronunciation of a second word of the plurality of words of the first language, wherein the first pronunciation of the first word and the first pronunciation of the second word both comprise a first type of pronunciation;

information indicative of a second pronunciation of the first word of the plurality of words of the first language and information indicative of a second pronunciation of the second word of the plurality of words of the first language, wherein the second pronunciation of the first word and the second pronunciation of the second word both comprise a second type of pronunciation that is different than the first type of pronunciation;

an automatic speech recognition (ASR) system configured to;

receive at least one utterance from a user, the utterance comprising at least the first word of the plurality of words of the first language; and

determine a type of pronunciation the user used for the first word in the at least one utterance; and

a text to speech (TTS) system configured to generate an audio speech output comprising the at least the second word of the plurality of words of the first language, and to determine a pronunciation of the second word in the audio speech output based, at least in part, on the type of pronunciation the ASR system determined the user used for the first word in the at least one utterance, wherein the second word is different from the first word.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are provided for automatically building a native phonetic lexicon for a speech-based application trained to process a native (base) language, wherein the native phonetic lexicon includes native phonetic transcriptions (base forms) for non-native (foreign) words which are automatically derived from non-native phonetic transcriptions of the non-native words.

Citations

20 Claims

1. A speech-based system comprising:
- at least one storage device that stores;
  
  an input text comprising a plurality of words of a first language;
  
  information indicative of a first pronunciation of a first word of the plurality of words of the first language and information indicative of a first pronunciation of a second word of the plurality of words of the first language, wherein the first pronunciation of the first word and the first pronunciation of the second word both comprise a first type of pronunciation;
  
  information indicative of a second pronunciation of the first word of the plurality of words of the first language and information indicative of a second pronunciation of the second word of the plurality of words of the first language, wherein the second pronunciation of the first word and the second pronunciation of the second word both comprise a second type of pronunciation that is different than the first type of pronunciation;
  
  an automatic speech recognition (ASR) system configured to;
  
  receive at least one utterance from a user, the utterance comprising at least the first word of the plurality of words of the first language; and
  
  determine a type of pronunciation the user used for the first word in the at least one utterance; and
  
  a text to speech (TTS) system configured to generate an audio speech output comprising the at least the second word of the plurality of words of the first language, and to determine a pronunciation of the second word in the audio speech output based, at least in part, on the type of pronunciation the ASR system determined the user used for the first word in the at least one utterance, wherein the second word is different from the first word.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The speech-based system of claim 1, wherein:
    - the first type of pronunciation is indicative of a speaker that natively speaks the first language; and
      
      the second type of pronunciation is indicative of a speaker that does not natively speak the first language.
  - 3. The speech-based system of claim 1, wherein the TTS system uses a statistical model to process the input text and generate the audio speech output, the statistical model comprising a first weight associated with the first type of pronunciation and a second weight associated with the second type of pronunciation.
  - 4. The speech-based system of claim 3, wherein the first weight and/or the second weight are adjusted based on the at least one utterance from the user.
  - 5. The speech-based system of claim 4, wherein the at least one utterance comprises a plurality of utterances, wherein the first weight and/or the second weight is adjusted based on the plurality of utterances from the user such that at least the second word of the first language is pronounced by the TTS system using the type of pronunciation most commonly used by the user in the plurality of utterances.
  - 6. The speech-based system of claim 1, further comprising:
    - a text processing system configured to;
      
      receive the input text; and
      
      identify the plurality of words of the first language within the input text.
  - 7. The speech-based system of claim 6, wherein the text processing system identifies the plurality of words of the first language based, at least in part, on letter-sequences and/or accented characters.

8. A method comprising acts, performed by at least one processor, of:
- storing information indicative of a first pronunciation of a first word of a first language and information indicative of a first pronunciation of a second word of the first language, wherein the first pronunciation of the first word and the first pronunciation of the second word both comprise a first type of pronunciation;
  
  storing information indicative of a second pronunciation of the first word of the first language and information indicative of a second pronunciation of the second word of the first language, wherein the second pronunciation of the first word and the second pronunciation of the second word both comprise a second type of pronunciation that is different than the first type of pronunciation;
  
  receiving, at an automatic speech recognition (ASR) system, at least one utterance from a user, the utterance comprising at least the first word of the first language;
  
  determining a type of pronunciation the user used for the first word in the at least one utterance; and
  
  generating, using a text to speech (TTS) system, an audio speech output that comprises at least the second word of the first language and that pronounces at least the second word using an audible pronunciation determined based, at least in part, on the type of pronunciation the user used for the first word in the at least one utterance, wherein the second word is different from the first word.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
- - 9. The method of claim 8, wherein:
    - the first type of pronunciation is indicative of a speaker that natively speaks the first language; and
      
      the second type of pronunciation is indicative of a speaker that does not natively speak the first language.
  - 10. The method of claim 8, wherein the act of generating the audio output comprises using a statistical model to generate the audio output.
  - 11. The method of claim 10, further comprising acts of:
    - storing a first weight for use by the statistical model, the first weight associated with the first type of pronunciation; and
      
      storing a second weight for use by the statistical model, the second weight associated with the second type of pronunciation.
  - 12. The method of claim 11, further comprising an act of:
    - adjusting the first and/or second weight based on the at least one utterance from the user.
  - 13. The method of claim 12, wherein the at least one utterance comprises a plurality of utterances, wherein adjusting the first weight and/or the second weight is based on the plurality of utterances from the user such that the audible pronunciation of the second word of the audio output is the type of pronunciation most commonly used by the user in the plurality of utterances.
  - 14. The method of claim 8, further comprising acts of:
    - receiving an input text comprising a plurality of words of the first language, wherein the plurality of words of the first language comprises the first word and the second word; and
      
      identifying the plurality of words of the first language within the input text.
  - 15. The method of claim 14, wherein the act of identifying the at plurality of words of the first language within the input text is based, at least in part, on letter-sequences and/or accented characters.

16. At least one program storage device having encoded thereon executable program code that, when executed by at least one processor, performs a method comprising acts of:
- storing information indicative of a first pronunciation of a first word of a first language and information indicative of a first pronunciation of a second word of the first language, wherein the first pronunciation of the first word and the first pronunciation of the second word both comprise a first type of pronunciation;
  
  storing information indicative of a second pronunciation of the first word of the first language and information indicative of a second pronunciation of the second word of the first language, wherein the second pronunciation of the first word and the second pronunciation of the second word both comprise a second type of pronunciation that is different than the first type of pronunciation;
  
  receiving at least one utterance from a user, the utterance comprising at least the first word of the first language;
  
  determining a type of pronunciation the user used for the first word in the at least one utterance;
  
  determining a pronunciation of at least one second word of the first language based at least on the type of pronunciation the user used for the at least one first word, wherein the second word is different from the first word; and
  
  generating an audio speech output that comprises at least the second word of the first language and that pronounces at least the second word using an audible pronunciation determined based, at least in part, on the type of pronunciation the user used for the first word in the at least one utterance, wherein the second word is different from the first word.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The at least one program storage device of claim 16, wherein:
    - the first type of pronunciation is indicative of a speaker that natively speaks the first language; and
      
      the second type of pronunciation is indicative of a speaker that does not natively speak the first language.
  - 18. The at least one program storage device of claim 16, wherein the act of converting the at least one word of the first language uses a statistical model, and wherein the method further comprises acts of:
    - storing a first weight for use by the statistical model, the first weight associated with the first type of pronunciation;
      
      storing a second weight for use by the statistical model, the second weight associated with the second type of pronunciation; and
      
      adjusting the first and/or second weight based on the at least one utterance from the user.
  - 19. The at least one program storage device of claim 18, wherein the at least one utterance comprises a plurality of utterances, wherein adjusting the first weight and/or the second weight is based on the plurality of utterances from the user such that the audible pronunciation of the second word of the audio output is the type of pronunciation most commonly used by the user in the plurality of utterances.
  - 20. The at least one program storage device of claim 16, wherein the method further comprises acts of:
    - receiving an input text comprising a plurality of words of the first language, wherein the plurality of words of the first language comprises the first word and the second word; and
      
      identifying the plurality of words of the first language within the input text, wherein identifying the plurality of words of the first language is based, at least in part, on letter-sequences and/or accented characters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Alewine, Neal J., Janke, Eric William, Sharp, Paul, Sicconi, Robert
Primary Examiner(s)
SKED, MATTHEW J

Application Number

US13/110,140
Publication Number

US 20110218806A1
Time in Patent Office

496 Days
Field of Search

None
US Class Current

704/260
CPC Class Codes

G10L 13/08   Text analysis or generation...

G10L 15/063   Training

G10L 15/187   Phonemic context, e.g. pron...

Determining text to speech pronunciation based on an utterance from a user

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Determining text to speech pronunciation based on an utterance from a user

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links