SPEECH PROCESSING SYSTEM AND METHOD
First Claim
Patent Images
1. A method of training an acoustic model for a text-to-speech system,the method comprising:
- receiving speech data,said speech data comprising data corresponding to different values of a first speech factor,and wherein said speech data is unlabelled, such that for a given item of speech data, the value of said first speech factor is unknown;
clustering said speech data according to the value of said first speech factor into a first set of clusters; and
estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor,wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of training an acoustic model for a text-to-speech system,
- the method comprising:
- receiving speech data;
- said speech data comprising data corresponding to different values of a first speech factor,
- and wherein said speech data is unlabelled, such that for a given item of speech data, the value of said first speech factor is unknown;
clustering said speech data according to the value of said first speech factor into a first set of clusters; and - estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor,
- wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.
-
Citations
24 Claims
-
1. A method of training an acoustic model for a text-to-speech system,
the method comprising: -
receiving speech data, said speech data comprising data corresponding to different values of a first speech factor, and wherein said speech data is unlabelled, such that for a given item of speech data, the value of said first speech factor is unknown; clustering said speech data according to the value of said first speech factor into a first set of clusters; and estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor, wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 18, 22)
-
-
14. A text-to-speech method configured to output speech having a target value of a speech factor,
said method comprising: -
inputting audio data with said target value of a speech factor; adapting an acoustic model to said target value of a speech factor; inputting text; dividing said inputted text into a sequence of acoustic units; converting said sequence of acoustic units into a sequence of speech vectors using said acoustic model; and outputting said sequence of speech vectors as audio with said target value of a speech factor, wherein said acoustic model comprises a set of speech factor parameters relating to said speech factor, and a set of speech factor clusters relating to said speech factor, and wherein said set of speech factor parameters and said set of speech factor clusters relating to said speech factor are unlabelled, such that for a given one or more clusters and a given one or more parameters, the value of said speech factor to which they relate is unknown. - View Dependent Claims (15, 23)
-
-
16. A text to speech method, the method comprising:
-
receiving input text; dividing said inputted text into a sequence of acoustic units; converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said acoustic model comprises a set of speaker parameters and a set of speaker clusters relating to speaker voice and a set of expression parameters and a set of expression clusters relating to expression, and wherein the sets of speaker and expression of parameters and the sets of speaker and expression clusters do not overlap; and outputting said sequence of speech vectors as audio, the method further comprising determining at least some of said parameters relating to expression by; extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space. - View Dependent Claims (17, 24)
-
-
19. A system for training an acoustic model for a text-to-speech system, said system comprising:
-
an input for receiving speech data corresponding to different values of a first speech factor, wherein said speech data is unlabelled, such that, for a given item of data, the value of said first speech factor is unknown; a processor configured to; cluster said speech data according to the value of said first speech factor into a first set of clusters; and estimate a first set of parameters to enable to the acoustic model to accommodate speech for the different values of the first speech factor, wherein said clustering and said first parameter estimation are jointly performed according to a single maximum likelihood criterion which is common to both said first parameter estimation and said clustering into said first set of clusters.
-
-
20. A system configured to output speech having a target value of a speech factor, said system comprising:
-
an input for receiving adaptation data with said target value of a speech factor; an input for receiving text; and a processor configured to adapt an acoustic model to said target value of a speech factor; divide said inputted text into a sequence of acoustic units; convert said sequence of acoustic units into a sequence of speech vectors using said acoustic model; and output said sequence of speech vectors as audio with said target value of a speech factor, wherein said acoustic model comprises a first set of parameters relating to said speech factor, and a first set of clusters relating to said speech factor, and wherein said first set of parameters and said first set of clusters relating to said speech factor are unlabelled, such that for a given one or more clusters and a given one or more parameters, the value of said first speech factor is unknown.
-
-
21. A text to speech system, the system comprising:
-
an input for receiving input text; and a processor configured to divide said inputted text into a sequence of acoustic units; convert said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said acoustic model comprises a first set of parameters and a first set of clusters relating to speaker voice and a second set of parameters and a second set of clusters relating to expression, and wherein the first and second set of parameters and the first and second set of clusters do not overlap; and output said sequence of speech vectors as audio, determine at least some of said parameters relating to expression by; extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space.
-
Specification