Text to speech processing system and method, and an acoustic model training system and method
First Claim
Patent Images
1. A text-to-speech method configured to output speech having a target value of a speech factor,said method comprising:
- inputting audio data with said target value of a speech factor;
adapting an acoustic model to said target value of a speech factor;
inputting text;
dividing said inputted text into a sequence of acoustic units;
converting said sequence of acoustic units into a sequence of speech vectors using said acoustic model; and
outputting said sequence of speech vectors as audio with said target value of a speech factor,wherein said acoustic model comprises a set of speech factor parameters that enable the acoustic model to accommodate speech for the different values of a speech factor,and wherein said set of speech factor parameters are unlabeled, such that for a given one or more parameters, the value of said speech factor to which they relate is unknown,and wherein adapting the acoustic model comprises adjusting the speech factor parameters to substantially match the target value of a speech factor,wherein said text to speech method includes training said acoustic model using a method comprising;
receiving speech data,said speech data comprising data corresponding to different values of the speech factor,and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said speech factor is unknown;
clustering said speech data according to the value of said speech factor into a first set of clusters; and
estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the speech factor,wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of training an acoustic model for a text-to-speech system,
- the method comprising:
- receiving speech data;
- said speech data comprising data corresponding to different values of a first speech factor,
- and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown;
clustering said speech data according to the value of said first speech factor into a first set of clusters; and - estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor,
- wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.
8 Citations
9 Claims
-
1. A text-to-speech method configured to output speech having a target value of a speech factor,
said method comprising: -
inputting audio data with said target value of a speech factor; adapting an acoustic model to said target value of a speech factor; inputting text; dividing said inputted text into a sequence of acoustic units; converting said sequence of acoustic units into a sequence of speech vectors using said acoustic model; and outputting said sequence of speech vectors as audio with said target value of a speech factor, wherein said acoustic model comprises a set of speech factor parameters that enable the acoustic model to accommodate speech for the different values of a speech factor, and wherein said set of speech factor parameters are unlabeled, such that for a given one or more parameters, the value of said speech factor to which they relate is unknown, and wherein adapting the acoustic model comprises adjusting the speech factor parameters to substantially match the target value of a speech factor, wherein said text to speech method includes training said acoustic model using a method comprising; receiving speech data, said speech data comprising data corresponding to different values of the speech factor, and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said speech factor is unknown; clustering said speech data according to the value of said speech factor into a first set of clusters; and estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the speech factor, wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion. - View Dependent Claims (2, 8)
-
-
3. A text to speech method, the method comprising:
-
receiving input text; dividing said inputted text into a sequence of acoustic units; converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said acoustic model comprises a set of speaker parameters and a set of speaker clusters relating to speaker voice and a set of expression parameters and a set of expression clusters relating to expression, and wherein the sets of speaker and expression parameters and the sets of speaker and expression clusters do not overlap; and outputting said sequence of speech vectors as audio, the method further comprising determining at least some of said parameters relating to expression by; extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space, wherein said text to speech method includes training said acoustic model using a method comprising; receiving speech data, said speech data further comprising speech data from one or more speakers speaking with neutral speech, said speech data comprising data corresponding to different values of a first speech factor, wherein the first speech factor is speaker and speech data corresponding to different values of a second speech factor, wherein the second speech factor is expression, and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown; clustering said speech data according to the value of said first speech factor into a first set of clusters and clustering said speech data according to the value of said second speech factor into a second set of clusters; and estimating a first set and a second set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor and the second speech factor respectively, wherein said clustering and the parameter estimation are jointly performed according to a common maximum likelihood criterion which is common to both parameter estimation and said clustering. - View Dependent Claims (4, 9)
-
-
5. A text to speech method, the method comprising:
-
receiving input text; dividing said inputted text into a sequence of acoustic units; converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model; and outputting said sequence of speech vectors as audio, the method further comprising determining at least some of said second set of parameters by; extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space, wherein said text to speech method includes training said acoustic model using a method comprising; receiving speech data, said speech data further comprising speech data from one or more speakers speaking with neutral speech, said speech data comprising data corresponding to different values of a first speech factor, wherein the first speech factor is speaker and speech data corresponding to different values of a second speech factor, wherein the second speech factor is expression, and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown; clustering said speech data according to the value of said first speech factor into a first set of clusters and clustering said speech data according to the value of said second speech factor into a second set of clusters; and estimating a first set and a second set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor and the second speech factor respectively, wherein said clustering and the parameter estimation are jointly performed according to a common maximum likelihood criterion which is common to both parameter estimation and said clustering, and wherein said first and second set of parameters and said first and second set of clusters do not overlap.
-
-
6. A system configured to output speech having a target value of a speech factor, said system comprising:
-
an audio input for receiving audio data with said target value of a speech factor; a text input for receiving text; and a processor configured to adapt an acoustic model to said target value of a speech factor; divide said inputted text into a sequence of acoustic units; convert said sequence of acoustic units into a sequence of speech vectors using said acoustic model; and output said sequence of speech vectors as audio with said target value of a speech factor, wherein said acoustic model comprises a first set of parameters that enable to acoustic model to accommodate speech for the different values of the speech factor; and wherein said first set of parameters are unlabeled, such that for a given one or more parameters, the value of said first speech factor is unknown, and wherein adapting the acoustic model comprises adjusting the speech factor parameters to substantially match the target value of a speech factor, wherein said text to speech method includes training said acoustic model using a method comprising; receiving speech data, said speech data comprising data corresponding to different values of the speech factor, and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said speech factor is unknown; clustering said speech data according to the value of said speech factor into a first set of clusters; and estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the speech factor, wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.
-
-
7. A text to speech system, the system comprising:
-
an input for receiving input text; and a processor configured to divide said inputted text into a sequence of acoustic units; convert said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said acoustic model comprises a first set of parameters and a first set of clusters relating to speaker voice and a second set of parameters and a second set of clusters relating to expression, and wherein the first and second set of parameters and the first and second set of clusters do not overlap; and output said sequence of speech vectors as audio, determine at least some of said parameters relating to expression by; extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space, wherein said text to speech method includes training said acoustic model using a method comprising; receiving speech data, said speech data further comprising speech data from one or more speakers speaking with neutral speech, said speech data comprising data corresponding to different values of a first speech factor, wherein the first speech factor is speaker and speech data corresponding to different values of a second speech factor, wherein the second speech factor is expression, and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown; clustering said speech data according to the value of said first speech factor into a first set of clusters and clustering said speech data according to the value of said second speech factor into a second set of clusters; and estimating a first set and a second set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor and the second speech factor respectively, wherein said clustering and the parameter estimation are jointly performed according to a common maximum likelihood criterion which is common to both parameter estimation and said clustering.
-
Specification