Method and system for building text-to-speech voice from diverse recordings

US 9,542,927 B2
Filed: 11/13/2014
Issued: 01/10/2017
Est. Priority Date: 11/13/2014
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

extracting speech features from a plurality of recorded reference speech utterances of a reference speaker to generate a reference set of reference-speaker vectors;

for each respective plurality of recorded colloquial speech utterances of a respective colloquial speaker of multiple colloquial speakers, extracting speech features from the recorded colloquial speech utterances of the respective colloquial speaker to generate a respective set of colloquial-speaker vectors;

for each respective set of colloquial-speaker vectors, replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with a respective, optimally-matched reference-speaker vector from among the reference set of reference-speaker vectors, the respective, optimally-matched reference-speaker vector being identified by matching under a transform that compensates for differences in speech between the reference speaker and the respective colloquial speaker;

aggregating the replaced colloquial-speaker vectors of all the respective sets of colloquial-speaker vectors into an aggregate set of conditioned speaker vectors;

providing the aggregate set of conditioned speaker vectors to a text-to-speech (TTS) system implemented on one or more computing devices; and

training the TTS system using the provided aggregate set of conditioned speaker vectors.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system is disclosed for building a speech database for a text-to-speech (TTS) synthesis system from multiple speakers recorded under diverse conditions. For a plurality of utterances of a reference speaker, a set of reference-speaker vectors may be extracted, and for each of a plurality of utterances of a colloquial speaker, a respective set of colloquial-speaker vectors may be extracted. A matching procedure, carried out under a transform that compensates for speaker differences, may be used to match each colloquial-speaker vector to a reference-speaker vector. The colloquial-speaker vector may be replaced with the matched reference-speaker vector. The matching-and-replacing can be carried out separately for each set of colloquial-speaker vectors. A conditioned set of speaker vectors can then be constructed by aggregating all the replaced speaker vectors. The condition set of speaker vectors can be used to train the TTS system.

37 Citations

View as Search Results

33 Claims

1. A method comprising:
- extracting speech features from a plurality of recorded reference speech utterances of a reference speaker to generate a reference set of reference-speaker vectors;
  
  for each respective plurality of recorded colloquial speech utterances of a respective colloquial speaker of multiple colloquial speakers, extracting speech features from the recorded colloquial speech utterances of the respective colloquial speaker to generate a respective set of colloquial-speaker vectors;
  
  for each respective set of colloquial-speaker vectors, replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with a respective, optimally-matched reference-speaker vector from among the reference set of reference-speaker vectors, the respective, optimally-matched reference-speaker vector being identified by matching under a transform that compensates for differences in speech between the reference speaker and the respective colloquial speaker;
  
  aggregating the replaced colloquial-speaker vectors of all the respective sets of colloquial-speaker vectors into an aggregate set of conditioned speaker vectors;
  
  providing the aggregate set of conditioned speaker vectors to a text-to-speech (TTS) system implemented on one or more computing devices; and
  
  training the TTS system using the provided aggregate set of conditioned speaker vectors.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein each given colloquial-speaker vector of each respective set of colloquial-speaker vectors has an associated enriched transcription derived from a respective text string associated with a particular recorded colloquial speech utterance from which the given colloquial-speaker vector was extracted,and wherein replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with the respective, optimally-matched reference-speaker vector comprises:
    - for each given colloquial-speaker vector of the respective set of colloquial-speaker vectors that is replaced, retaining its associated enriched transcription.
  - 3. The method of claim 2, wherein aggregating the replaced colloquial-speaker vectors of all the respective sets of colloquial-speaker vectors into the aggregate set of conditioned speaker vectors comprises constructing a TTS system speech corpus that includes the replaced colloquial-speaker vectors of all the respective sets of colloquial-speaker vectors and the retained enriched transcriptions associated with each given colloquial-speaker vector that was replaced.
  - 4. The method of claim 1, wherein, for each respective set of colloquial-speaker vectors, replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with the respective, optimally-matched reference-speaker vector from among the reference set of reference-speaker vectors comprises:
    - individually matching all of the colloquial-speaker vectors of each respective set with their respective, optimally-matched reference-speaker vectors, one respective set at a time.
  - 5. The method of claim 1, wherein extracting speech features from the plurality of recorded reference speech utterances of the reference speaker comprises decomposing the recorded reference speech utterances of the reference speaker into reference temporal frames of parameterized reference speech units, each reference temporal frame corresponding to a respective reference-speaker vector of speech features that include at least one of spectral envelope parameters, aperiodicity envelope parameters, fundamental frequencies, or voicing, of a respective reference speech unit,and wherein extracting speech features from the recorded colloquial speech utterances of the respective colloquial speaker comprises decomposing the recorded colloquial speech utterances of the respective colloquial speaker into colloquial temporal frames of parameterized colloquial speech units, each colloquial temporal frame corresponding to a respective colloquial-speaker vector of speech features that include at least one of spectral envelope parameters, aperiodicity envelope parameters, fundamental frequencies, or voicing, of a respective colloquial speech unit.
  - 6. The method of claim 5, wherein replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with the respective, optimally-matched reference-speaker vector from among the reference set of reference-speaker vectors comprises:
    - for each respective colloquial-speaker vector, determining an optimal match between the speech features the respective colloquial-speaker vector and the speech features of a particular one of the reference-speaker vectors, the optimal match being determined under a transform that compensates for differences in speech between the reference speaker and the respective colloquial speaker; and
      
      for each respective colloquial-speaker vector, replacing the speech features of the respective colloquial-speaker vector with the speech features of the determined particular one of the reference-speaker vectors.
  - 7. The method of claim 5, the spectral envelope parameters of each vector of reference speech features are Mel Cepstral coefficients, Line Spectral Pairs, Linear Predictive coefficients, or Mel-Generalized Cepstral Coefficients, and further include indicia of first and second time derivatives of the spectral envelope parameters,and wherein the spectral envelope parameters of each vector of colloquial speech features are Mel Cepstral coefficients, Line Spectral Pairs, Linear Predictive coefficients, or Mel-Generalized Cepstral Coefficients, and further include indicia of first and second time derivatives of the spectral envelope parameters.
  - 8. The method of claim 5, wherein the reference speech units each correspond to one of a phoneme or a triphone,and wherein the colloquial speech units each correspond to one of a phoneme or a triphone.
  - 9. The method of claim 1, wherein the recorded reference speech utterances of the reference speaker are in a reference language and the colloquial speech utterances of all the respective colloquial speakers are all in a colloquial language,and wherein the colloquial language is lexically related to the reference language.
  - 10. The method of claim 9, wherein the colloquial language differs from the reference language.
  - 11. The method of claim 9, wherein training the TTS system using the provided aggregate set of conditioned speaker vectors comprises training the TTS system to synthesize speech in the colloquial language and in a voice of the reference speaker.

12. A system comprising:
- one or more processors;
  
  memory; and
  
  machine-readable instructions stored in the memory, that upon execution by the one or more processors cause the system to carry out operations including;
  
  extracting speech features from a plurality of recorded reference speech utterances of a reference speaker to generate a reference set of reference-speaker vectors,for each respective plurality of recorded colloquial speech utterances of a respective colloquial speaker of multiple colloquial speakers, extracting speech features from the recorded colloquial speech utterances of the respective colloquial speaker to generate a respective set of colloquial-speaker vectors,for each respective set of colloquial-speaker vectors, replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with a respective, optimally-matched reference-speaker vector from among the reference set of reference-speaker vectors, wherein the respective, optimally-matched reference-speaker vector is identified by matching under a transform that compensates for differences in speech between the reference speaker and the respective colloquial speaker,aggregating the replaced colloquial-speaker vectors of all the respective sets of colloquial-speaker vectors into an aggregate set of conditioned speaker vectors,providing the aggregate set of conditioned speaker vectors to a text-to-speech (TTS) system, andtraining the TTS system using the provided aggregate set of conditioned speaker vectors.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The system of claim 12, wherein each given colloquial-speaker vector of each respective set of colloquial-speaker vectors has an associated enriched transcription derived from a respective text string associated with a particular recorded colloquial speech utterance from which the given colloquial-speaker vector was extracted,and wherein replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with the respective, optimally-matched reference-speaker vector comprises:
    - for each given colloquial-speaker vector of the respective set of colloquial-speaker vectors that is replaced, retaining its associated enriched transcription.
  - 14. The system of claim 13, wherein aggregating the replaced colloquial-speaker vectors of all the respective sets of colloquial-speaker vectors into the aggregate set of conditioned speaker vectors comprises constructing a TTS system speech corpus that includes the replaced colloquial-speaker vectors of all the respective sets of colloquial-speaker vectors and the retained enriched transcriptions associated with each given colloquial-speaker vector that was replaced.
  - 15. The system of claim 12, wherein, for each respective set of colloquial-speaker vectors, replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with the respective, optimally-matched reference-speaker vector from among the reference set of reference-speaker vectors comprises:
    - individually matching all of the colloquial-speaker vectors of each respective set with their respective, optimally-matched reference-speaker vectors, one respective set at a time.
  - 16. The system of claim 12, wherein extracting speech features from the plurality of recorded reference speech utterances of the reference speaker comprises decomposing the recorded reference speech utterances of the reference speaker into reference temporal frames of parameterized reference speech units, wherein each reference temporal frame corresponds to a respective reference-speaker vector of speech features that include at least one of spectral envelope parameters, aperiodicity envelope parameters, fundamental frequencies, or voicing, of a respective reference speech unit,and wherein extracting speech features from the recorded colloquial speech utterances of the respective colloquial speaker comprises decomposing the recorded colloquial speech utterances of the respective colloquial speaker into colloquial temporal frames of parameterized colloquial speech units, wherein each colloquial temporal frame corresponds to a respective colloquial-speaker vector of speech features that include at least one of spectral envelope parameters, aperiodicity envelope parameters, fundamental frequencies, or voicing, of a respective colloquial speech unit.
  - 17. The system of claim 16, wherein replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with the respective, optimally-matched reference-speaker vector from among the reference set of reference-speaker vectors comprises:
    - for each respective colloquial-speaker vector, determining an optimal match between the speech features the respective colloquial-speaker vector and the speech features of a particular one of the reference-speaker vectors, wherein the optimal match is determined under a transform that compensates for differences in speech between the reference speaker and the respective colloquial speaker; and
      
      for each respective colloquial-speaker vector, replacing the speech features of the respective colloquial-speaker vector with the speech features of the determined particular one of the reference-speaker vectors.
  - 18. The system of claim 16, the spectral envelope parameters of each vector of reference speech features are Mel Cepstral coefficients, Line Spectral Pairs, Linear Predictive coefficients, or Mel-Generalized Cepstral Coefficients, and further include indicia of first and second time derivatives of the spectral envelope parameters,and wherein the spectral envelope parameters of each vector of colloquial speech features are Mel Cepstral coefficients, Line Spectral Pairs, Linear Predictive coefficients, or Mel-Generalized Cepstral Coefficients, and further include indicia of first and second time derivatives of the spectral envelope parameters.
  - 19. The system of claim 16, wherein the reference speech units each correspond to one of a phoneme or a triphone,and wherein the colloquial speech units each correspond to one of a phoneme or a triphone.
  - 20. The system of claim 12, wherein the recorded reference speech utterances of the reference speaker are in a reference language and the colloquial speech utterances of all the respective colloquial speakers are all in a colloquial language,and wherein the colloquial language is lexically related to the reference language.
  - 21. The system of claim 20, wherein the colloquial language differs from the reference language.
  - 22. The system of claim 20, wherein training the TTS system using the provided aggregate set of conditioned speaker vectors comprises training the TTS system to synthesize speech in the colloquial language and in a voice of the reference speaker.

23. An article of manufacture including a non-transitory computer-readable storage medium having stored thereon program instructions that, upon execution by one or more processors of a system, cause the system to perform operations comprising:
- extracting speech features from a plurality of recorded reference speech utterances of a reference speaker to generate a reference set of reference-speaker vectors;
  
  for each respective plurality of recorded colloquial speech utterances of a respective colloquial speaker of multiple colloquial speakers, extracting speech features from the recorded colloquial speech utterances of the respective colloquial speaker to generate a respective set of colloquial-speaker vectors;
  
  for each respective set of colloquial-speaker vectors, replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with a respective, optimally-matched reference-speaker vector from among the reference set of reference-speaker vectors, wherein the respective, optimally-matched reference-speaker vector is identified by matching under a transform that compensates for differences in speech between the reference speaker and the respective colloquial speaker;
  
  aggregating the replaced colloquial-speaker vectors of all the respective sets of colloquial-speaker vectors into an aggregate set of conditioned speaker vectors;
  
  providing the aggregate set of conditioned speaker vectors to a text-to-speech (TTS) system implemented on one or more computing devices; and
  
  training the TTS system using the provided aggregate set of conditioned speaker vectors.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 24. The article of manufacture of claim 23, wherein each given colloquial-speaker vector of each respective set of colloquial-speaker vectors has an associated enriched transcription derived from a respective text string associated with a particular recorded colloquial speech utterance from which the given colloquial-speaker vector was extracted,and wherein replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with the respective, optimally-matched reference-speaker vector comprises:
    - for each given colloquial-speaker vector of the respective set of colloquial-speaker vectors that is replaced, retaining its associated enriched transcription.
  - 25. The article of manufacture of claim 24, wherein aggregating the replaced colloquial-speaker vectors of all the respective sets of colloquial-speaker vectors into the aggregate set of conditioned speaker vectors comprises constructing a TTS system speech corpus that includes the replaced colloquial-speaker vectors of all the respective sets of colloquial-speaker vectors and the retained enriched transcriptions associated with each given colloquial-speaker vector that was replaced.
  - 26. The article of manufacture of claim 23, wherein, for each respective set of colloquial-speaker vectors, replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with the respective, optimally-matched reference-speaker vector from among the reference set of reference-speaker vectors comprises:
    - individually matching all of the colloquial-speaker vectors of each respective set with their respective, optimally-matched reference-speaker vectors, one respective set at a time.
  - 27. The article of manufacture of claim 23, wherein extracting speech features from the plurality of recorded reference speech utterances of the reference speaker comprises decomposing the recorded reference speech utterances of the reference speaker into reference temporal frames of parameterized reference speech units, wherein each reference temporal frame corresponds to a respective reference-speaker vector of speech features that include at least one of spectral envelope parameters, aperiodicity envelope parameters, fundamental frequencies, or voicing, of a respective reference speech unit,and wherein extracting speech features from the recorded colloquial speech utterances of the respective colloquial speaker comprises decomposing the recorded colloquial speech utterances of the respective colloquial speaker into colloquial temporal frames of parameterized colloquial speech units, wherein each colloquial temporal frame corresponds to a respective colloquial-speaker vector of speech features that include at least one of spectral envelope parameters, aperiodicity envelope parameters, fundamental frequencies, or voicing, of a respective colloquial speech unit.
  - 28. The article of manufacture of claim 27, wherein replacing each colloquial-speaker vector of the respective set of colloquial-speaker vectors with the respective, optimally-matched reference-speaker vector from among the reference set of reference-speaker vectors comprises:
    - for each respective colloquial-speaker vector, determining an optimal match between the speech features the respective colloquial-speaker vector and the speech features of a particular one of the reference-speaker vectors, wherein the optimal match is determined under a transform that compensates for differences in speech between the reference speaker and the respective colloquial speaker; and
      
      for each respective colloquial-speaker vector, replacing the speech features of the respective colloquial-speaker vector with the speech features of the determined particular one of the reference-speaker vectors.
  - 29. The article of manufacture of claim 27, the spectral envelope parameters of each vector of reference speech features are Mel Cepstral coefficients, Line Spectral Pairs, Linear Predictive coefficients, or Mel-Generalized Cepstral Coefficients, and further include indicia of first and second time derivatives of the spectral envelope parameters,and wherein the spectral envelope parameters of each vector of colloquial speech features are Mel Cepstral coefficients, Line Spectral Pairs, Linear Predictive coefficients, or Mel-Generalized Cepstral Coefficients, and further include indicia of first and second time derivatives of the spectral envelope parameters.
  - 30. The article of manufacture of claim 27, wherein the reference speech units each correspond to one of a phoneme or a triphone,and wherein the colloquial speech units each correspond to one of a phoneme or a triphone.
  - 31. The article of manufacture of claim 23, wherein the recorded reference speech utterances of the reference speaker are in a reference language and the colloquial speech utterances of all the respective colloquial speakers are all in a colloquial language,and wherein the colloquial language is lexically related to the reference language.
  - 32. The article of manufacture of claim 31, wherein the colloquial language differs from the reference language.
  - 33. The article of manufacture of claim 31, wherein training the TTS system using the provided aggregate set of conditioned speaker vectors comprises training the TTS system to synthesize speech in the colloquial language and in a voice of the reference speaker.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Agiomyrgiannakis, Ioannis, Gutkin, Alexander
Primary Examiner(s)
JACKSON, JAKIEDA R

Application Number

US14/540,088
Publication Number

US 20160140951A1
Time in Patent Office

789 Days
Field of Search

704/2, 704/260
US Class Current

1/1
CPC Class Codes

G10L 13/02   Methods for producing synth...

G10L 13/06   Elementary speech units use...

G10L 25/03   characterised by the type o...

Method and system for building text-to-speech voice from diverse recordings

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

37 Citations

33 Claims

Specification

Use Cases

Quick Links

Others

Method and system for building text-to-speech voice from diverse recordings

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

37 Citations

33 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others