Text to speech method and system converting acoustic units to speech vectors using language dependent weights for a selected language
First Claim
1. A text-to-speech method for use in a plurality of languages,said method comprising:
- inputting text in a selected language;
dividing said inputted text into a sequence of acoustic units;
converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said model has a plurality of model parameters describing probability distributions which relate an acoustic unit to a speech vector; and
outputting said sequence of speech vectors as audio in said selected language,wherein a parameter of a predetermined type of each probability distribution in said selected language is expressed as a weighted sum of language independent parameters of the same type, and wherein the weighting used is language dependent, such that converting said sequence of acoustic units to a sequence of speech vectors comprises retrieving the language dependent weights for said selected language.
4 Assignments
0 Petitions
Accused Products
Abstract
A text-to-speech method for use in a plurality of languages, including: inputting text in a selected language; dividing the inputted text into a sequence of acoustic units; converting the sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein the model has a plurality of model parameters describing probability distributions which relate an acoustic unit to a speech vector; and outputting the sequence of speech vectors as audio in the selected language. A parameter of a predetermined type of each probability distribution in the selected language is expressed as a weighted sum of language independent parameters of the same type. The weighting used is language dependent, such that converting the sequence of acoustic units to a sequence of speech vectors includes retrieving the language dependent weights for the selected language.
-
Citations
18 Claims
-
1. A text-to-speech method for use in a plurality of languages,
said method comprising: -
inputting text in a selected language; dividing said inputted text into a sequence of acoustic units; converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said model has a plurality of model parameters describing probability distributions which relate an acoustic unit to a speech vector; and outputting said sequence of speech vectors as audio in said selected language, wherein a parameter of a predetermined type of each probability distribution in said selected language is expressed as a weighted sum of language independent parameters of the same type, and wherein the weighting used is language dependent, such that converting said sequence of acoustic units to a sequence of speech vectors comprises retrieving the language dependent weights for said selected language. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of training a text to speech system, said text-to-speech system comprising an acoustic model used to convert a sequence of acoustic units to a sequence of speech vectors, said model having a plurality of model parameters comprising language and speaker independent parameters, language dependent parameters and speaker dependent parameters describing probability distributions relating acoustic units to speech vectors, the method comprising:
-
expressing the auxiliary function of an Expectation Maximisation algorithm in terms of language and speaker independent parameters, language dependent parameters and speaker dependent parameters, said auxiliary function involving summations of data from different languages and different speakers; receiving a plurality of inputs of audio speech each of which correspond to known text from a speaker in a known language, wherein at least two inputs have different languages; deriving an initial estimate of the language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters; maximising said auxiliary function with respect to language and speaker independent parameters, language dependent parameters and speaker dependent parameters to obtain a better estimate of said language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters; repeating said maximisation step until said language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters converge, wherein a predetermined type of parameter in each probability distribution for a language is expressed as a weighted sum of language independent parameters of the same type and said language dependent parameters are said weightings which are language dependent. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. A method of training a text to speech system, said text-to-speech system comprising an acoustic model used to convert a sequence of acoustic units to a sequence of speech vectors, said model having a plurality of model parameters comprising language and speaker independent parameters, language dependent parameters and speaker dependent parameters describing probability distributions relating acoustic units to speech vectors, the method comprising:
-
expressing the auxiliary function of an Expectation Maximisation algorithm in terms of language and speaker independent parameters, language dependent parameters and speaker dependent parameters, said auxiliary function involving summations of data from different languages and different speakers; receiving a plurality of inputs of audio speech each of which correspond to known text from a speaker in a known language, wherein at least two inputs have different languages; deriving an initial estimate of the language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters; maximising said auxiliary function with respect to language and speaker independent parameters, language dependent parameters and speaker dependent parameters to obtain a better estimate of said language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters; repeating said maximisation step until said language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters converge, wherein said speaker and language independent parameters comprise variances of said probability distributions, wherein said variances are clustered and a decision tree is formed for each cluster.
-
-
15. A method of adapting a polyglot text-to-speech system to operate in a new language,
said polyglot text-to-speech system comprising: -
an acoustic model used to convert a sequence of acoustic units to a sequence of speech vectors, said model having a plurality of model parameters comprising language and speaker independent parameters describing probability distributions relating acoustic units to a speech vectors, language dependent parameters and speaker independent parameters and speaker dependent parameters, wherein a predetermined type of parameter in each probability distribution for a language is expressed as a weighted sum of language independent parameters of the same type and said language dependent parameters are said weightings which are language dependent, said method comprising; expressing the auxiliary function of an Expectation Maximisation algorithm in terms of language and speaker independent parameters, language dependent parameters and speaker dependent parameters, said auxiliary function involving summations of data from different languages and different speakers; receiving a plurality of inputs of audio speech each of which correspond to known text from at least two speakers in said new language; obtaining an initial estimate of the speaker dependent parameters use for the speakers of the new language; obtaining an initial estimate of the language dependent parameters for said new language; maximising said auxiliary function with respect to language dependent parameters and speaker dependent parameters to obtain a better estimate of said language dependent parameters and speaker dependent parameters for all speakers and languages; repeating said maximisation step until said language dependent parameters and the speaker dependent parameters converge.
-
-
16. A text-to-speech processing system for use in a plurality of languages,
said system comprising: -
a text input configured to accept inputted text; a processor configured to; divide said inputted text into a sequence of acoustic units; convert said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said model has a plurality of model parameters describing probability distributions which relate an acoustic unit to a speech vector; and output said sequence of speech vectors as audio in said selected language, wherein a parameter of a predetermined type of each probability distribution in said selected language is expressed as a weighted sum of language independent parameters of the same type, and wherein the weighting used is language dependent, such that converting said sequence of acoustic units to a sequence of speech vectors comprises retrieving the language dependent weights for said selected language.
-
-
17. A trainable text-to-speech system, the system comprising a processor configured to run an acoustic model which converts a sequence of acoustic units to a sequence of speech vectors, said model having a plurality of model parameters comprising language and speaker independent parameters, language dependent parameters and speaker dependent parameters describing probability distributions relating acoustic units to speech vectors, wherein a predetermined type of parameter in each probability distribution for a language is expressed as a weighted sum of language independent parameters of the same type and said language dependent parameters are said weightings which are language dependent,
the processor being configured to: -
express the auxiliary function of an Expectation Maximisation algorithm in terms of language and speaker independent parameters, language dependent parameters and speaker dependent parameters, said auxiliary function involving summations of data from different languages and different speakers; receive a plurality of inputs of audio speech each of which correspond to known text from a speaker in a known language, wherein at least two inputs have different languages; derive an initial estimate of the language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters; maximise said auxiliary function with respect to language and speaker independent parameters, language dependent parameters and speaker dependent parameters to obtain a better estimate of said language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters; and repeat said maximisation until said language and speaker independent parameters, the language dependent parameters and the speaker dependent parameters converge.
-
-
18. A polyglot text-to-speech system which is adaptable to a new language,
said polyglot text-to-speech system comprising a processor configured to run an acoustic model used to convert a sequence of acoustic units to a sequence of speech vectors, said model having a plurality of model parameters comprising language and speaker independent parameters describing probability distributions relating acoustic units to a speech vectors, language dependent parameters and speaker independent parameters and speaker dependent parameters, wherein a predetermined type of parameter in each probability distribution for a language is expressed as a weighted sum of language independent parameters of the same type and said language dependent parameters are said weightings which are language dependent, said processor being further configured to: -
express the auxiliary function of an Expectation Maximisation algorithm in terms of language and speaker independent parameters, language dependent parameters and speaker dependent parameters, said auxiliary function involving summations of data from different languages and different speakers; receive a plurality of inputs of audio speech each of which correspond to known text from at least two speakers in said new language; obtain an initial estimate of the speaker dependent parameters use for the speakers of the new language; obtain an initial estimate of the language dependent parameters for said new language; maximise said auxiliary function with respect to language dependent parameters and speaker dependent parameters to obtain a better estimate of said language dependent parameters and speaker dependent parameters for all speakers and languages; and repeat said maximisation step until said language dependent parameters and the speaker dependent parameters converge.
-
Specification