Statistical pronunciation model for text to speech

US 20030191645A1
Filed: 04/05/2002
Published: 10/09/2003
Est. Priority Date: 04/05/2002
Status: Abandoned Application

First Claim

Patent Images

1. A method, comprising:

establishing at least one statistical pronounciation model based on annotated training data;

receiving input text;

determining a pronounciation of a word in the input text based on zero or more of the statistical pronounciation models; and

synthesizing speech signal corresponding to the input text through synthesizing the acoustic signal of each word in the input text using the pronounciation determined in said determining.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An arrangement is provided for speech synthesis using statistical pronunciation models established based on annotated training data. When input text is received, pronunciations of words in the input text are determined based on the use of relevant statistical pronunciation models. The speech signal corresponding to the input text is then synthesized using the determined pronunciations.

174 Citations

29 Claims

1. A method, comprising:
- establishing at least one statistical pronounciation model based on annotated training data;
  
  receiving input text;
  
  determining a pronounciation of a word in the input text based on zero or more of the statistical pronounciation models; and
  
  synthesizing speech signal corresponding to the input text through synthesizing the acoustic signal of each word in the input text using the pronounciation determined in said determining.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method according to claim 1, wherein said establishing at least one statistical pronounciation model comprises:
    - retrieving the annotated training data wherein words are annotated in terms of their pronounciations taking into acount of context of the words;
      
      performing statistical analysis of the annotated training data with respect to the context of words; and
      
      building a statistical pronounciation model for each pronounciation of annotated words in the annotated training data based on the statistical analysis.
  - 3. The method acording to claim 1, wherein said determining the pronounciation of a word comprises:
    - analyzing the input text to determine context of the word in the input text;
      
      selecting a pronounciation of the word according to a statistical pronounciation model of the word that is relevant to the context.
  - 4. The method according to claim 3, wherein said selecting the pronounciation includes determining the pronounciation according to at least one pronounciation rule.
  - 5. The method according to claim 4, further comprising retrieving the selected pronounciation from a dictionary prior to said synthesizing.

6. A method to establish a statistical pronounciation model, comprising:
- retrieving annotated training data wherein words are annotated in terms of their pronounciations taking into acount of context of the words;
  
  performing statistical analysis of the annotated training data with respect to the context; and
  
  building a statistical pronounciation model for each pronounciation of the annotated words in the annotated training data based on the statistical analysis.
- View Dependent Claims (7, 8)
- - 7. The method according to claim 6, further comprising generating the annotated training data prior to said retrieving.
  - 8. The method according to claim 7, wherein said generating includes:
    - collecting training data;
      
      determining contextual features of words in the training data whose pronouncitations are to be annotated;
      
      annotating the words in the training data in terms of their pronounciations with respect to their relevant contextual features to generate the annotated training data.

9. A method to synthesizing speech data, comprising:
- receiving input text;
  
  analyzing the input text to identify contextual features of words in the input text;
  
  determining a pronounciation of each word according to a statistical pronounciation model of the word relevant to the contextual features of the word; and
  
  synthesizing acoustic signal of the word based on the pronounciation.
- View Dependent Claims (10, 11)
- - 10. The method according to claim 9, wherein said selecting the pronounciation includes determining the pronounciation according to at least one pronounciation rule.
  - 11. The method accoring to claim 10, further comprising retrieving the determined pronounciation from a dictionary prior to said synthesizing.

12. A system, comprising:
- a statistical pronounciation modeling mechanism for establishing at least one statistical pronounciation model based on annotated training data; and
  
  a speech synthesis mechanism for synthesizing speech from input text based on the statistical pronounciation models.
- View Dependent Claims (13, 14, 15)
- - 13. The system according to claim 12, wherein the statistical pronounciation modeling mechanism comprises:
    - a context sensitive pronounciation annotation mechanism for generating the annotated training data; and
      
      a statistical pronounciation model generation mechanism for creating the statistical pronounciation models based on the annotated training data.
  - 14. The system according to claim 12, further comprising:
    - at least one pronounciation rule for governing the determination of a pronounciation of a word in the input text; and
      
      a dictionary for storing a plurality of pronounciations.
  - 15. The system according to claim 14, wherein the speech synthesis mechanism comprises:
    - a text processing mechanism for processing the input text to identify contextual features;
      
      a pronounciation determiner for determining a pronounciation of each word in the input text according to a statistical pronounciation model relevant to the contextual features and the pronounciation rules; and
      
      a text to speech engine for producing acoustic signal for each word in the input text using the pronounciation of each word, retrieved from the dictionary, to generate the speech of the input text.

16. A statistical pronounciation modeling mechanism, comprising:
- a context sensitive pronounciation annotation mechanism for generating annotated training data in which words are annotated with their pronounciations; and
  
  a statistical pronounciation model generation mechanism for creating statistical pronounciation models based on the annotated training data.
- View Dependent Claims (17, 18)
- - 17. The mechanism according to claim 16, wherein the context sensitive pronounciation annotation mechanism comprises:
    - a context identifier for identifying relevant context in training data;
      
      a contextual feature identifier for identifying relevant contextual features related to the context; and
      
      a pronouncitation annotation mechanism for annotating pronouncitations of words in the training data with respect to the contextual features to generate the annotated training data.
  - 18. The mechanism according to claim 16, wherein the statistical pronounciation model generation mechanism comprises:
    - a statistical analysis mechanism for performing statistical analysis on the annotated training data; and
      
      a statistical pronounciation construction mechanism for generating statistical pronounciation models based on the statistical analysis results performed on the annotated training data.

19. A speech synthesis mechanism, comprising:
- a text processing mechanism for processing the input text to identify contextual features;
  
  a pronounciation determiner for determining a pronounciation of each word in the input text according to a statistical pronounciation model and the pronounciation rules relevant to the contextual features; and
  
  a text to speech engine for producing acoustic signal for each word in the input text using the pronounciation of each word, retrieved from a dictionary, to generate the speech of the input text.
- View Dependent Claims (20)
- - 20. The mechanism according to claim 19, further comprising:
    - a statistical pronounciation model retrieval mechanism for retrieving a statistical pronounciation model based on the contextual features; and
      
      a pronounciation rule retrieval mechanism for retrieving the pronounciation rules relevant to the contextual features.

21. A machine-accessible medium encoded with data, the data, when accessed, causing:
- establishing at least one statistical pronounciation model based on annotated training data;
  
  receiving input text;
  
  determining a pronounciation of a word in the input text based on at least some of the statistical pronounciation models; and
  
  synthesizing speech signal corresponding to the input text through synthesizing the acoustic signal of each word in the input text using the pronounciation determined in said determining.
- View Dependent Claims (22, 23)
- - 22. The medium according to claim 21, wherein said establishing at least one statistical pronounciation model comprises:
    - retrieving the annotated training data wherein words are annotated in terms of their pronounciations taking into acount of context of the words;
      
      performing statistical analysis of the annotated training data with respect to the context of words; and
      
      building a statistical pronounciation model for each pronounciation of annotated words in the annotated training data based on the statistical analysis.
  - 23. The medium acording to claim 21, wherein said determining the pronounciation of a word comprises:
    - analyzing the input text to determine context of the word in the input text;
      
      selecting a pronounciation of the word according to zero or more statistical pronounciation model of the word and pronunciation rule that are relevant to the context.

24. A machine-accessible medium encoded with data for establishing a statistical pronounciation model, the data, when accessed, causing:
- retrieving annotated training data wherein words are annotated in terms of their pronounciations taking into acount of context of the words;
  
  performing statistical analysis of the annotated training data with respect to the context; and
  
  building a statistical pronounciation model for each pronounciation of the annotated words in the annotated training data based on the statistical analysis.
- View Dependent Claims (25, 26)
- - 25. The medium according to claim 24, the data, when accessed, further causing generating the annotated training data prior to said retrieving.
  - 26. The medium according to claim 25, wherein said generating includes:
    - collecting training data;
      
      determining contextual features of words in the training data whose pronouncitations are to be annotated;
      
      annotating the words in the training data in terms of their pronounciations with respect to their relevant contextual features to generate the annotated training data.

27. A machine-accessible medium encoded with data for synthesizing speech data, the data, when accessed, causing:
- receiving input text;
  
  analyzing the input text to identify contextual features of words in the input text;
  
  determining a pronounciation of each word according to a statistical pronounciation model of the word relevant to the contextual features of the word; and
  
  synthesizing acoustic signal of the word based on the pronounciation.
- View Dependent Claims (28, 29)
- - 28. The medium according to claim 27, wherein said selecting the pronounciation includes determining the pronounciation according to at least one pronounciation rule.
  - 29. The medium accoring to claim 28, the data, when accessed, further causing retrieving the determined pronounciation from a dictionary prior to said synthesizing.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intel Corporation
Original Assignee
Intel Corporation
Inventors
Zhou, Guojun

Application Number

US10/115,935
Publication Number

US 20030191645A1
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/08 Text analysis or generation...

G10L 15/183 using context dependencies,...

Statistical pronunciation model for text to speech

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

174 Citations

29 Claims

Specification

Use Cases

Quick Links

Others

Statistical pronunciation model for text to speech

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

174 Citations

29 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others