Training of text-to-speech systems
First Claim
Patent Images
1. A method of constructing a model for use in a text-to-speech synthesis system, said method comprising the steps of:
- providing a first input of speech from a first training speaker, the first input of speech including at least one sentence;
providing a second input of speech from a second training speaker, the second input of speech including at least one sentence;
obtaining a first set of features and a first corresponding observation value from the first input of speech;
said step of obtaining a first set of features and a first corresponding observation value including tracking pitch over each sentence;
obtaining a second set of features and a second corresponding observation value from the second input of speech;
said step of obtaining a second set of features and a second corresponding observation value including tracking pitch over each sentence; and
pooling said first and second corresponding observation values to obtain the model.
8 Assignments
0 Petitions
Accused Products
Abstract
Building a data-driven text-to-speech system involves collecting a database of natural speech from which to train models or select segments for concatenation. Typically the speech in that database is produced by a single speaker. In this invention we include in our database speech from a multiplicity of speakers.
179 Citations
18 Claims
-
1. A method of constructing a model for use in a text-to-speech synthesis system, said method comprising the steps of:
-
providing a first input of speech from a first training speaker, the first input of speech including at least one sentence;
providing a second input of speech from a second training speaker, the second input of speech including at least one sentence;
obtaining a first set of features and a first corresponding observation value from the first input of speech;
said step of obtaining a first set of features and a first corresponding observation value including tracking pitch over each sentence;
obtaining a second set of features and a second corresponding observation value from the second input of speech;
said step of obtaining a second set of features and a second corresponding observation value including tracking pitch over each sentence; and
pooling said first and second corresponding observation values to obtain the model.
-
-
2. A method of constructing a model for use in a text-to-speech synthesis system, said method comprising the steps of:
-
providing a first input of speech from a first training speaker, the first input of speech including at least one sentence;
providing additional inputs of speech from a plurality of additional training speakers, the additional inputs of speech each including at least one sentence;
obtaining a set of features and a corresponding observation value from the first input of speech;
said step of obtaining a first set of features and a first corresponding observation value including tracking pitch over each sentence;
repeating said step of obtaining a set of features and a corresponding observation value, including tracking pitch over each sentence, for each of the plurality of additional inputs of speech;
pooling said corresponding observation values, from said first speaker and said additional speakers, to obtain the model.
-
-
3. A method for enrolling training data for a text-to-speech synthesis system, said method comprising the steps of:
-
collecting speech data from at least two speakers, the speech data from each speaker including at least one sentence;
ascertaining at least one characteristic relating to the speech data of each speaker;
said ascertaining step comprising tracking pitch over each sentence; and
creating a target range of speech data via transforming the at least one characteristic relating to the speech data of each speaker. - View Dependent Claims (4, 5, 6, 7, 8)
the observation values comprise pitch values; and
said normalizing step comprises calculating average pitch over a predetermined quantity of speech data and thence obtaining normalized pitch values via dividing each pitch value within the predetermined quantity of speech data by said average.
-
-
8. The method according to claim 7, wherein said transforming step comprises multiplying each normalized pitch value by a target pitch value, the target pitch value being the average pitch of a target speaker.
-
9. An apparatus for constructing a model for use in a text-to speech synthesis system, said apparatus comprising:
-
an input arrangement which provides;
a first input of speech from a first training speaker, the first input of speech including at least one sentence; and
a second input of speech from a second training speaker, the second input of speech including at least one sentence;
an extracting arrangement which obtains a first set of features and a first corresponding observation value from the first input of speech;
said extracting arrangement being adapted to further obtain a second set of features and a second corresponding observation value from the input of speech;
said extracting arrangement being adapted to track pitch over each sentence; and
a pooling arrangement which pools said first and second corresponding observation values to obtain the model.
-
-
10. An apparatus for constructing a model for use in a text-to-speech synthesis system, said apparatus comprising:
-
an input arrangement which provides;
a first input of speech from a first training speaker, the first input of speech including at least one sentence; and
additional inputs of speech from a plurality of additional training speakers, the additional inputs of speech each including at least one sentence;
an extracting arrangement which obtains a set of features and a corresponding observation value from the first input of speech;
said extracting arrangement being adapted to further obtain a set of features and a corresponding observation value for each of the plurality of additional inputs of;
said extracting arrangement being adapted to track pitch over each sentence; and
a pooling arrangement which pools said corresponding observation values, from said first speaker and said additional speakers, to obtain the model.
-
-
11. An apparatus for enrolling training data for a text-to-speech synthesis system, said apparatus comprising:
-
an input arrangement which collects speech data from at least two speakers, the speech data from each speaker including at least one sentence;
an ascertaining arrangement which ascertains at least one characteristic relating to the speech data of each speaker;
said ascertaining arrangement being adapted to track pitch over each sentence; and
a target range creator which creates a target range of speech data via transforming the at least one characteristic relating to the speech data of each speaker. - View Dependent Claims (12, 13, 14, 15, 16)
the observation values comprise pitch values; and
said normalizer is adapted to calculate average pitch over a predetermined quantity of speech data and thence obtain normalized pitch values via dividing each pitch value within the predetermined quantity of speech data by said average.
-
-
16. The apparatus according to claim 15, wherein said target range creator is adapted to multiply each normalized pitch value by a target pitch value, the target pitch value being the average pitch of a target speaker.
-
17. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for constructing a model for use in a text-to-speech synthesis system, said method comprising the steps of:
-
providing a first input of speech from a first training speaker, the first input of speech including at least one sentence;
providing a second input of speech from a second training speaker, the second input of speech including at least one sentence;
obtaining a first set of features and a first corresponding observation value from the first input of speech;
said step of obtaining a first set of features and a first corresponding observation value including tracking pitch over each sentence;
obtaining a second set of features and a second corresponding observation value from the second input of speech;
said step of obtaining a second set of features and a second corresponding observation value including tracking pitch over each sentence; and
pooling said first and second corresponding observation values to obtain the model.
-
-
18. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for enrolling training data for a text-to-speech synthesis system, said method comprising the steps of:
-
collecting speech data from at least two speakers, the speech data from each speaker including at least one sentence;
ascertaining at least one characteristic relating to the speech data of each speaker;
said ascertaining step comprising tracking pitch over each sentence; and
creating a target range of speech data via transforming the at least one characteristic relating to the speech data of each speaker.
-
Specification