MULTI-LINGUAL TEXT-TO-SPEECH SYSTEM AND METHOD

US 20120173241A1
Filed: 08/25/2011
Published: 07/05/2012
Est. Priority Date: 12/30/2010
Status: Active Grant

First Claim

Patent Images

1. A multi-lingual text-to-speech system, comprising:

an acoustic-prosodic model selection module, for an inputted text to be synthesized and containing a second-language (L2) portion, and an L2 phonetic unit transcription corresponding to the L2 portion of the inputted text, sequentially finds a second acoustic-prosodic model corresponding to each phonetic unit of the L2 phonetic unit transcription in an L2 acoustic-prosodic model set, searches an L2-to-L1 phonetic unit transformation table, L1 being a first language, and uses at least a controllable accent weighting parameter to determine a transformation combination to select a corresponding L1 phonetic unit transcription and sequentially find a first acoustic-prosodic model corresponding to each phonetic unit of said L1 phonetic unit transcription in an L1 acoustic-prosodic model set;

an acoustic-prosodic model mergence module that merges said first and said second acoustic-prosodic models into a merged acoustic-prosodic model according to said at least a controllable accent weighting parameter, sequentially processes all the transformations in said transformation combination, then sequentially arranges each merged acoustic-prosodic model to generate a merged acoustic-prosodic model sequence; and

a speech synthesizer, wherein said merged acoustic-prosodic model sequence is applied to said speech synthesizer to synthesize said inputted text into an L2 speech with an L1 accent.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A multi-lingual text-to-speech system and method processes a text to be synthesized via an acoustic-prosodic model selection module and an acoustic-prosodic model mergence module, and obtains a phonetic unit transformation table. In an online phase, the acoustic-prosodic model selection module, according to the text and a phonetic unit transcription corresponding to the text, uses at least a set controllable accent weighting parameter to select a transformation combination and find a second and a first acoustic-prosodic models. The acoustic-prosodic model mergence module merges the two acoustic-prosodic models into a merged acoustic-prosodic model, according to the at least a controllable accent weighting parameter, processes all transformations in the transformation combination and generates a merged acoustic-prosodic model sequence. A speech synthesizer and the merged acoustic-prosodic model sequence are further applied to synthesize the text into an L1-accent L2 speech.

26 Citations

View as Search Results

14 Claims

1. A multi-lingual text-to-speech system, comprising:
- an acoustic-prosodic model selection module, for an inputted text to be synthesized and containing a second-language (L2) portion, and an L2 phonetic unit transcription corresponding to the L2 portion of the inputted text, sequentially finds a second acoustic-prosodic model corresponding to each phonetic unit of the L2 phonetic unit transcription in an L2 acoustic-prosodic model set, searches an L2-to-L1 phonetic unit transformation table, L1 being a first language, and uses at least a controllable accent weighting parameter to determine a transformation combination to select a corresponding L1 phonetic unit transcription and sequentially find a first acoustic-prosodic model corresponding to each phonetic unit of said L1 phonetic unit transcription in an L1 acoustic-prosodic model set;
  
  an acoustic-prosodic model mergence module that merges said first and said second acoustic-prosodic models into a merged acoustic-prosodic model according to said at least a controllable accent weighting parameter, sequentially processes all the transformations in said transformation combination, then sequentially arranges each merged acoustic-prosodic model to generate a merged acoustic-prosodic model sequence; and
  
  a speech synthesizer, wherein said merged acoustic-prosodic model sequence is applied to said speech synthesizer to synthesize said inputted text into an L2 speech with an L1 accent.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system as claimed in claim 1, wherein said L2-to-L1 phonetic unit transformation table is constructed in an offline phase via a phonetic unit transformation table construction module, according to an L1-accent L2 speech corpus and an L1 acoustic-prosodic model set.
  - 3. The system as claimed in claim 1, wherein said acoustic-prosodic model mergence module merges said second acoustic-prosodic model and said first acoustic-prosodic model into said merged acoustic-prosodic model by using a weight computation scheme.
  - 4. The system as claimed in claim 1, wherein said second acoustic-prosodic model and said first acoustic-prosodic model at least comprise an acoustic parameter.
  - 5. The system as claimed in claim 4, wherein said second acoustic-prosodic model and said first acoustic-prosodic model further comprise a duration parameter and a pitch parameter.

6. A multi-lingual text-to-speech system, executed on a computer system, said computer system having a memory device for storing at least a first and a second language acoustic-prosodic model sets, said multi-lingual text-to-speech system comprising:
- a processor having an acoustic-prosodic model selection module, an acoustic-prosodic model mergence module and a speech synthesizer, wherein for an inputted text to be synthesized and containing a second-language (L2) portion, and an L2 phonetic unit transcription corresponding to the L2 portion of the inputted text, said acoustic-prosodic model selection module sequentially finds a second acoustic-prosodic model corresponding to each phonetic unit of the L2 phonetic unit transcription in an L2 acoustic-prosodic model set, searches an L2-to-L1 phonetic unit transformation, L1 being a first language, and uses at least a controllable accent weighting parameter to determine a transformation combination to select a corresponding L1 phonetic unit transcription and sequentially find a first acoustic-prosodic model corresponding to each phonetic unit of said L1 phonetic unit transcription in an L1 acoustic-prosodic model set, said acoustic-prosodic model mergence module merges said first and said second acoustic-prosodic models into a merged acoustic-prosodic model according to said at least a controllable accent weighting parameter, sequentially processes all the transformations in said transformation combination, then sequentially arranges each merged acoustic-prosodic model to generate a merged acoustic-prosodic model sequence, and said merged acoustic-prosodic model sequence is further applied to said speech synthesizer to synthesize said inputted text into an L2 speech with an L1 accent.

7. A multi-lingual text-to-speech method, executed on a computer system, said computer system having a memory device for storing at least a first and a second language acoustic-prosodic model sets, said method comprising:
- for an inputted text with second-language (L2) and L2 phonetic sequence corresponding to said inputted text to be synthesized, finding a second acoustic-prosodic model corresponding to each phonetic unit of said L2 phonetic unit transcription in an L2 acoustic-prosodic model set, searching an L2-to-L1 phonetic unit transformation table, L1 being a first language, and using at least a controllable accent weighting parameter to determine a transformation combination to select a corresponding L1 phonetic unit transcription and find a first acoustic-prosodic model corresponding to each phonetic unit of said L1 phonetic unit transcription in an L1 acoustic-prosodic model set;
  
  merging said first and said second acoustic-prosodic models into a merged acoustic-prosodic model according to said at least a controllable accent weighting parameter, processing all transformations in said transformation combination, and generating a merged acoustic-prosodic model sequence; and
  
  applying said merged acoustic-prosodic model set to a speech synthesizer to synthesize said inputted text into an L1-accent L2 speech.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
- - 8. The method as claimed in claim 7, said method further comprising constructing said phonetic unit transformation table, said constructing phonetic unit transformation table further comprising:
    - selecting a plurality of audio files and a plurality of L2 phonetic unit transcriptions corresponding to said audio files from an L2 speech bank;
      
      for each selected audio file, said L1 acoustic-prosodic model performing a free syllable speech recognition to generate a recognition result and transform said recognition result into an L1 phonetic unit transcription, using a dynamic programming to perform phonetic unit alignment on said L2 phonetic unit transcription corresponding to said audio file and said L1 phonetic unit transcription, after finishing dynamic programming, a transformation combination being obtained; and
      
      accumulating statistics from the obtained plurality of transformation combinations in above step to generate said phonetic unit transformation table.
  - 9. The method as claimed in claim 8, wherein said dynamic programming further comprises using Bhattacharyya distance, used in statistics to compute distance between two discrete probability distributions, to compute local distance between two acoustic-prosodic models.
  - 10. The method as claimed in claim 7, wherein said phonetic unit transformation table comprises three types of transformation, and said three types of transformation are substitution, insertion and deletion.
  - 11. The method as claimed in claim 10, wherein substitution is a one-to-one transformation, insertion is a one-to-many transformation and deletion is a many-to-one transformation.
  - 12. The method as claimed in claim 10, said method uses said dynamic programming to find at least a corresponding phonetic unit and at least a transformation type for said inputted text to be synthesized.
  - 13. The method as claimed in claim 7, wherein said merged acoustic-prosodic model further comprises a Gaussian density function g_new(μ
    - _new, Σ
      
      _new), expressed as;
      
      μ
      
      _new=w*μ
      
      ₁+(1−
      
      w)*μ
      
      ₂
      Σ
      
      _new=w*(Σ
      
      ₁+(μ
      
      ₁−
      
      μ
      
      _new)²)+(1−
      
      w)*(Σ
      
      ₂+(μ
      
      ₂−
      
      μ
      
      _new)²)where said first acoustic-prosodic model is expressed by a Gaussian density function as g₁(μ
      
      ₁,Σ
      
      ₁), said first acoustic-prosodic model is expressed by another Gaussian density function as g₂(μ
      
      ₂,Σ
      
      ₂), μ
      
      is average vector and Σ
      
      is co-variance matrix, 0≦
      
      w≦
      
      1.
  - 14. The method as claimed in claim 8, wherein said generating said recognition result further comprises performing a free tone recognition.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Industrial Technology Research Institute
Original Assignee
Industrial Technology Research Institute
Inventors
Kuo, Chih-Chung, LI, Jen-Yu, Tu, Jia-Jang

Granted Patent

US 8,898,066 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/086 Detection of language

G10L 13/10 Prosody rules derived from ...

MULTI-LINGUAL TEXT-TO-SPEECH SYSTEM AND METHOD

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

26 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

MULTI-LINGUAL TEXT-TO-SPEECH SYSTEM AND METHOD

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

26 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links