Method and System for Building Text-to-Speech Voice from Diverse Recordings
First Claim
1. A method comprising:
- extracting speech features from a plurality of recorded reference speech utterances of a reference speaker to generate a reference set of reference-speaker vectors;
for each respective plurality of recorded colloquial speech utterances of a respective colloquial speaker of multiple colloquial speakers, extracting speech features from the recorded colloquial speech utterances of the respective colloquial speaker to generate a respective set of colloquial-speaker vectors;
for each respective set of colloquial-speaker vectors, replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with a respective, optimally-matched reference-speaker vector from among the reference set of reference-speaker vectors, the respective, optimally-matched reference-speaker vector being identified by matching under a transform that compensates for differences in speech between the reference speaker and the respective colloquial speaker;
aggregating the replaced colloquial-speaker vectors of all the respective sets of colloquial-speaker vectors into an aggregate set of conditioned speaker vectors;
providing the aggregate set of conditioned speaker vectors to a text-to-speech (TTS) system implemented on one or more computing devices; and
training the TTS system using the provided aggregate set of conditioned speaker vectors.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system is disclosed for building a speech database for a text-to-speech (TTS) synthesis system from multiple speakers recorded under diverse conditions. For a plurality of utterances of a reference speaker, a set of reference-speaker vectors may be extracted, and for each of a plurality of utterances of a colloquial speaker, a respective set of colloquial-speaker vectors may be extracted. A matching procedure, carried out under a transform that compensates for speaker differences, may be used to match each colloquial-speaker vector to a reference-speaker vector. The colloquial-speaker vector may be replaced with the matched reference-speaker vector. The matching-and-replacing can be carried out separately for each set of colloquial-speaker vectors. A conditioned set of speaker vectors can then be constructed by aggregating all the replaced speaker vectors. The condition set of speaker vectors can be used to train the TTS system.
160 Citations
33 Claims
-
1. A method comprising:
-
extracting speech features from a plurality of recorded reference speech utterances of a reference speaker to generate a reference set of reference-speaker vectors; for each respective plurality of recorded colloquial speech utterances of a respective colloquial speaker of multiple colloquial speakers, extracting speech features from the recorded colloquial speech utterances of the respective colloquial speaker to generate a respective set of colloquial-speaker vectors; for each respective set of colloquial-speaker vectors, replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with a respective, optimally-matched reference-speaker vector from among the reference set of reference-speaker vectors, the respective, optimally-matched reference-speaker vector being identified by matching under a transform that compensates for differences in speech between the reference speaker and the respective colloquial speaker; aggregating the replaced colloquial-speaker vectors of all the respective sets of colloquial-speaker vectors into an aggregate set of conditioned speaker vectors; providing the aggregate set of conditioned speaker vectors to a text-to-speech (TTS) system implemented on one or more computing devices; and training the TTS system using the provided aggregate set of conditioned speaker vectors. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system comprising:
-
one or more processors; memory; and machine-readable instructions stored in the memory, that upon execution by the one or more processors cause the system to carry out operations including; extracting speech features from a plurality of recorded reference speech utterances of a reference speaker to generate a reference set of reference-speaker vectors, for each respective plurality of recorded colloquial speech utterances of a respective colloquial speaker of multiple colloquial speakers, extracting speech features from the recorded colloquial speech utterances of the respective colloquial speaker to generate a respective set of colloquial-speaker vectors, for each respective set of colloquial-speaker vectors, replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with a respective, optimally-matched reference-speaker vector from among the reference set of reference-speaker vectors, wherein the respective, optimally-matched reference-speaker vector is identified by matching under a transform that compensates for differences in speech between the reference speaker and the respective colloquial speaker, aggregating the replaced colloquial-speaker vectors of all the respective sets of colloquial-speaker vectors into an aggregate set of conditioned speaker vectors, providing the aggregate set of conditioned speaker vectors to a text-to-speech (TTS) system, and training the TTS system using the provided aggregate set of conditioned speaker vectors. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. An article of manufacture including, a computer-readable storage medium having stored thereon program instructions that, upon execution by one or more processors of a system, cause the system to perform operations comprising:
-
for each respective plurality of recorded colloquial speech utterances of a respective colloquial speaker of multiple colloquial speakers, extracting speech features from the recorded colloquial speech utterances of the respective colloquial speaker to generate a respective set of colloquial-speaker vectors; for each respective set of colloquial-speaker vectors, replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with a respective, optimally-matched reference-speaker vector from among the reference set of reference-speaker vectors, wherein the respective, optimally-matched reference-speaker vector is identified by matching under a transform that compensates for differences in speech between the reference speaker and the respective colloquial speaker; aggregating the replaced colloquial-speaker vectors of all the respective sets of colloquial-speaker vectors into an aggregate set of conditioned speaker vectors; providing the aggregate set of conditioned speaker vectors to a text-to-speech (TTS) system implemented on one or more computing devices; and training the TTS system using the provided aggregate set of conditioned speaker vectors. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
-
Specification