Text to speech method and system converting acoustic units to speech vectors using language dependent weights for a selected language

US 8,825,485 B2
Filed: 06/10/2009
Issued: 09/02/2014
Est. Priority Date: 06/10/2009
Status: Active Grant

First Claim

Patent Images

1. A text-to-speech method for use in a plurality of languages,said method comprising:

inputting text in a selected language;

dividing said inputted text into a sequence of acoustic units;

converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said model has a plurality of model parameters describing probability distributions which relate an acoustic unit to a speech vector; and

outputting said sequence of speech vectors as audio in said selected language,wherein a parameter of a predetermined type of each probability distribution in said selected language is expressed as a weighted sum of language independent parameters of the same type, and wherein the weighting used is language dependent, such that converting said sequence of acoustic units to a sequence of speech vectors comprises retrieving the language dependent weights for said selected language.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A text-to-speech method for use in a plurality of languages, including: inputting text in a selected language; dividing the inputted text into a sequence of acoustic units; converting the sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein the model has a plurality of model parameters describing probability distributions which relate an acoustic unit to a speech vector; and outputting the sequence of speech vectors as audio in the selected language. A parameter of a predetermined type of each probability distribution in the selected language is expressed as a weighted sum of language independent parameters of the same type. The weighting used is language dependent, such that converting the sequence of acoustic units to a sequence of speech vectors includes retrieving the language dependent weights for the selected language.

Citations

18 Claims

1. A text-to-speech method for use in a plurality of languages,said method comprising:
- inputting text in a selected language;
  
  dividing said inputted text into a sequence of acoustic units;
  
  converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said model has a plurality of model parameters describing probability distributions which relate an acoustic unit to a speech vector; and
  
  outputting said sequence of speech vectors as audio in said selected language,wherein a parameter of a predetermined type of each probability distribution in said selected language is expressed as a weighted sum of language independent parameters of the same type, and wherein the weighting used is language dependent, such that converting said sequence of acoustic units to a sequence of speech vectors comprises retrieving the language dependent weights for said selected language.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A text-to speech method according to claim 1, wherein the parameter of a predetermined type is a mean.
  - 3. A text-to-speech system according to claim 1, wherein the probability distributions are selected from a Gaussian distribution, Poisson distribution, Gamma distribution, Student-t distribution or Laplacian distribution.
  - 4. A text-to-speech method according to claim 1, further comprising:
    - selecting a voice for the audio output, obtaining transform parameters for said voice and transforming the speech vectors and/or model parameters for the selected language to the selected voice using said transform parameters.
  - 5. A text-to-speech method according to claim 1, wherein said acoustic units are phonemes, graphemes, context dependent phonemes or graphemes, diphones, triphones or syllables.
  - 6. A text-to-speech method according to claim 1, wherein said acoustic model is a hidden Markov model or a hidden semi-Markov model.

7. A method of training a text to speech system, said text-to-speech system comprising an acoustic model used to convert a sequence of acoustic units to a sequence of speech vectors, said model having a plurality of model parameters comprising language and speaker independent parameters, language dependent parameters and speaker dependent parameters describing probability distributions relating acoustic units to speech vectors, the method comprising:
- expressing the auxiliary function of an Expectation Maximisation algorithm in terms of language and speaker independent parameters, language dependent parameters and speaker dependent parameters, said auxiliary function involving summations of data from different languages and different speakers;
  
  receiving a plurality of inputs of audio speech each of which correspond to known text from a speaker in a known language, wherein at least two inputs have different languages;
  
  deriving an initial estimate of the language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters;
  
  maximising said auxiliary function with respect to language and speaker independent parameters, language dependent parameters and speaker dependent parameters to obtain a better estimate of said language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters;
  
  repeating said maximisation step until said language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters converge, wherein a predetermined type of parameter in each probability distribution for a language is expressed as a weighted sum of language independent parameters of the same type and said language dependent parameters are said weightings which are language dependent.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. A method according to claim 7, wherein said predetermined type of parameter is the mean of a probability distribution.
  - 9. A method according to claim 8, wherein said means are clustered and a language dependent weighting is applied for each cluster for each language.
  - 10. A method according to claim 9, wherein each cluster is a decision tree, the decisions represented by said trees being related to linguistic, phonetic or prosodic variations.
  - 11. A method according to claim 10, wherein constructing the decision tree is performed after cycle of maximising the auxiliary function with respect to said language and speaker independent parameters, language dependent parameters and speaker dependent parameters.
  - 12. A method according to claim 7, wherein said speaker dependent parameters comprise transform parameters applied to a speech vector and/or speaker independent model parameters.
  - 13. A non-transitory carrier medium carrying computer readable instructions for controlling the computer to carry out the method of claim 7.

14. A method of training a text to speech system, said text-to-speech system comprising an acoustic model used to convert a sequence of acoustic units to a sequence of speech vectors, said model having a plurality of model parameters comprising language and speaker independent parameters, language dependent parameters and speaker dependent parameters describing probability distributions relating acoustic units to speech vectors, the method comprising:
- expressing the auxiliary function of an Expectation Maximisation algorithm in terms of language and speaker independent parameters, language dependent parameters and speaker dependent parameters, said auxiliary function involving summations of data from different languages and different speakers;
  
  receiving a plurality of inputs of audio speech each of which correspond to known text from a speaker in a known language, wherein at least two inputs have different languages;
  
  deriving an initial estimate of the language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters;
  
  maximising said auxiliary function with respect to language and speaker independent parameters, language dependent parameters and speaker dependent parameters to obtain a better estimate of said language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters;
  
  repeating said maximisation step until said language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters converge, wherein said speaker and language independent parameters comprise variances of said probability distributions, wherein said variances are clustered and a decision tree is formed for each cluster.

15. A method of adapting a polyglot text-to-speech system to operate in a new language,said polyglot text-to-speech system comprising:
- an acoustic model used to convert a sequence of acoustic units to a sequence of speech vectors, said model having a plurality of model parameters comprising language and speaker independent parameters describing probability distributions relating acoustic units to a speech vectors, language dependent parameters and speaker independent parameters and speaker dependent parameters, wherein a predetermined type of parameter in each probability distribution for a language is expressed as a weighted sum of language independent parameters of the same type and said language dependent parameters are said weightings which are language dependent,said method comprising;
  
  expressing the auxiliary function of an Expectation Maximisation algorithm in terms of language and speaker independent parameters, language dependent parameters and speaker dependent parameters, said auxiliary function involving summations of data from different languages and different speakers;
  
  receiving a plurality of inputs of audio speech each of which correspond to known text from at least two speakers in said new language;
  
  obtaining an initial estimate of the speaker dependent parameters use for the speakers of the new language;
  
  obtaining an initial estimate of the language dependent parameters for said new language;
  
  maximising said auxiliary function with respect to language dependent parameters and speaker dependent parameters to obtain a better estimate of said language dependent parameters and speaker dependent parameters for all speakers and languages;
  
  repeating said maximisation step until said language dependent parameters and the speaker dependent parameters converge.

16. A text-to-speech processing system for use in a plurality of languages,said system comprising:
- a text input configured to accept inputted text;
  
  a processor configured to;
  
  divide said inputted text into a sequence of acoustic units;
  
  convert said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said model has a plurality of model parameters describing probability distributions which relate an acoustic unit to a speech vector; and
  
  output said sequence of speech vectors as audio in said selected language,wherein a parameter of a predetermined type of each probability distribution in said selected language is expressed as a weighted sum of language independent parameters of the same type, and wherein the weighting used is language dependent, such that converting said sequence of acoustic units to a sequence of speech vectors comprises retrieving the language dependent weights for said selected language.

17. A trainable text-to-speech system, the system comprising a processor configured to run an acoustic model which converts a sequence of acoustic units to a sequence of speech vectors, said model having a plurality of model parameters comprising language and speaker independent parameters, language dependent parameters and speaker dependent parameters describing probability distributions relating acoustic units to speech vectors, wherein a predetermined type of parameter in each probability distribution for a language is expressed as a weighted sum of language independent parameters of the same type and said language dependent parameters are said weightings which are language dependent,the processor being configured to:
- express the auxiliary function of an Expectation Maximisation algorithm in terms of language and speaker independent parameters, language dependent parameters and speaker dependent parameters, said auxiliary function involving summations of data from different languages and different speakers;
  
  receive a plurality of inputs of audio speech each of which correspond to known text from a speaker in a known language, wherein at least two inputs have different languages;
  
  derive an initial estimate of the language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters;
  
  maximise said auxiliary function with respect to language and speaker independent parameters, language dependent parameters and speaker dependent parameters to obtain a better estimate of said language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters; and
  
  repeat said maximisation until said language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters converge.

18. A polyglot text-to-speech system which is adaptable to a new language,said polyglot text-to-speech system comprising a processor configured to run an acoustic model used to convert a sequence of acoustic units to a sequence of speech vectors, said model having a plurality of model parameters comprising language and speaker independent parameters describing probability distributions relating acoustic units to a speech vectors, language dependent parameters and speaker independent parameters and speaker dependent parameters, wherein a predetermined type of parameter in each probability distribution for a language is expressed as a weighted sum of language independent parameters of the same type and said language dependent parameters are said weightings which are language dependent,said processor being further configured to:
- express the auxiliary function of an Expectation Maximisation algorithm in terms of language and speaker independent parameters, language dependent parameters and speaker dependent parameters, said auxiliary function involving summations of data from different languages and different speakers;
  
  receive a plurality of inputs of audio speech each of which correspond to known text from at least two speakers in said new language;
  
  obtain an initial estimate of the speaker dependent parameters use for the speakers of the new language;
  
  obtain an initial estimate of the language dependent parameters for said new language;
  
  maximise said auxiliary function with respect to language dependent parameters and speaker dependent parameters to obtain a better estimate of said language dependent parameters and speaker dependent parameters for all speakers and languages; and
  
  repeat said maximisation step until said language dependent parameters and the speaker dependent parameters converge.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation), Toshiba Digital Solutions Corporation (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Chun, Byung Ha, Krstulovic, Sacha
Primary Examiner(s)
ROBERTS, SHAUN A

Application Number

US13/377,706
Publication Number

US 20120278081A1
Time in Patent Office

1,910 Days
Field of Search

704/258, 704/260
US Class Current

704/260
CPC Class Codes

G10L 13/02   Methods for producing synth...

G10L 13/033   Voice editing, e.g. manipul...

G10L 13/04   Details of speech synthesis...

G10L 15/142   Hidden Markov Models [HMMs]

Text to speech method and system converting acoustic units to speech vectors using language dependent weights for a selected language

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Text to speech method and system converting acoustic units to speech vectors using language dependent weights for a selected language

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links