Text-to-speech synthesis using an autoencoder

US 10,249,289 B2
Filed: 07/13/2017
Issued: 04/02/2019
Est. Priority Date: 03/14/2017
Status: Active Grant

First Claim

Patent Images

1. A method performed by one or more computers of a text-to-speech system, the method comprising:

obtaining, by the one or more computers, data indicating a text for text-to-speech synthesis;

providing, by the one or more computers, data indicating a linguistic unit of the text as input to an encoder, the encoder being configured to output speech unit representations indicative of acoustic characteristics based on linguistic information, wherein the encoder is configured to provide speech unit representations learned through machine learning training, wherein the encoder comprises a neural network that was trained as part of an autoencoder network that includes the encoder, a second encoder, and a decoder, wherein;

the encoder is arranged to produce speech unit representations in response to receiving data indicating linguistic units;

the second encoder is arranged to produce speech unit representations in response to receiving data indicating acoustic features of speech units; and

the decoder is arranged to generate output indicating acoustic features of speech units in response to receiving speech unit representations for the speech units from either of the encoder and the second encoder;

receiving, by the one or more computers, a speech unit representation that the encoder outputs in response to receiving the data indicating the linguistic unit as input to the encoder;

selecting, by the one or more computers, a speech unit to represent the linguistic unit, the speech unit being selected from among a collection of speech units based on the speech unit representation output by the encoder; and

providing, by the one or more computers and as output of the text-to-speech system, audio data for a synthesized utterance of the text that includes the selected speech unit.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and computer-readable media for text-to-speech synthesis using an autoencoder. In some implementations, data indicating a text for text-to-speech synthesis is obtained. Data indicating a linguistic unit of the text is provided as input to an encoder. The encoder is configured to output speech unit representations indicative of acoustic characteristics based on linguistic information. A speech unit representation that the encoder outputs is received. A speech unit is selected to represent the linguistic unit, the speech unit being selected from among a collection of speech units based on the speech unit representation output by the encoder. Audio data for a synthesized utterance of the text that includes the selected speech unit is provided.

Citations

17 Claims

1. A method performed by one or more computers of a text-to-speech system, the method comprising:
- obtaining, by the one or more computers, data indicating a text for text-to-speech synthesis;
  
  providing, by the one or more computers, data indicating a linguistic unit of the text as input to an encoder, the encoder being configured to output speech unit representations indicative of acoustic characteristics based on linguistic information, wherein the encoder is configured to provide speech unit representations learned through machine learning training, wherein the encoder comprises a neural network that was trained as part of an autoencoder network that includes the encoder, a second encoder, and a decoder, wherein;
  
  the encoder is arranged to produce speech unit representations in response to receiving data indicating linguistic units;
  
  the second encoder is arranged to produce speech unit representations in response to receiving data indicating acoustic features of speech units; and
  
  the decoder is arranged to generate output indicating acoustic features of speech units in response to receiving speech unit representations for the speech units from either of the encoder and the second encoder;
  
  receiving, by the one or more computers, a speech unit representation that the encoder outputs in response to receiving the data indicating the linguistic unit as input to the encoder;
  
  selecting, by the one or more computers, a speech unit to represent the linguistic unit, the speech unit being selected from among a collection of speech units based on the speech unit representation output by the encoder; and
  
  providing, by the one or more computers and as output of the text-to-speech system, audio data for a synthesized utterance of the text that includes the selected speech unit.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the encoder is configured to provide speech unit representations of a same size to represent speech units having different durations.
  - 3. The method of claim 1, wherein the encoder is trained to infer speech unit representations from linguistic unit identifiers, wherein the speech unit representations output by the encoder are vectors that have a same fixed length.
  - 4. The method of claim 1, wherein the encoder comprises a trained neural network having one or more long-short-term memory layers.
  - 5. The method of claim 1, wherein the encoder, the second encoder, and the decoder are trained jointly;
    - andwherein the encoder, the second encoder, and the decoder each include one or more long short-term memory layers.
  - 6. The method of claim 1, wherein the encoder, the second encoder, and the decoder are trained jointly using a cost function configured to minimize:
    - differences between acoustic features input to the second encoder and acoustic features generated by the decoder; and
      
      differences between the speech unit representations of the encoder and the speech unit representations of the second encoder.
  - 7. The method of claim 1, further comprising selecting a set of candidate speech units for the linguistic unit based on a vector distances between (i) a first vector that includes the speech unit representation output by the encoder and (ii) second vectors corresponding to speech units in the collection of speech units;
    - andgenerating a lattice that includes nodes corresponding to the candidate speech units in the selected set of candidate speech units.
  - 8. The method of claim 7, wherein selecting the set of candidate speech units comprises:
    - identifying a predetermined quantity of second vectors that are nearest neighbors for the first vector; and
      
      selecting, as the set of candidate speech units, a set of speech units corresponding to the identified predetermined quantity of second vectors that are nearest neighbors for the first vector.
  - 9. The method of claim 1, wherein the speech unit representation for the linguistic unit is a first speech unit representation for a first linguistic unit, wherein selecting the speech unit comprises:
    - obtaining a second speech unit representation for a second linguistic unit that occurs immediately before or after the first linguistic unit in a phonetic representation of the text;
      
      generating a diphone unit representation by concatenating the first speech unit representation with the second speech unit representation; and
      
      selecting, to represent the first linguistic unit, a diphone speech unit identified based on the diphone speech unit representation.

10. A system comprising:
- one or more computers; and
  
  one or more data storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising;
  
  obtaining, by the one or more computers, data indicating a text for text-to-speech synthesis;
  
  providing, by the one or more computers, data indicating a linguistic unit of the text as input to an encoder, the encoder being configured to output speech unit representations indicative of acoustic characteristics based on linguistic information, wherein the encoder is configured to provide speech unit representations learned through machine learning training, wherein the encoder comprises a neural network that was trained as part of an autoencoder network that includes the encoder, a second encoder, and a decoder, wherein;
  
  the encoder is arranged to produce speech unit representations in response to receiving data indicating linguistic units;
  
  the second encoder is arranged to produce speech unit representations in response to receiving data indicating acoustic features of speech units; and
  
  the decoder is arranged to generate output indicating acoustic features of speech units in response to receiving speech unit representations for the speech units from either of the encoder and the second encoder;
  
  receiving, by the one or more computers, a speech unit representation that the encoder outputs in response to receiving the data indicating the linguistic unit as input to the encoder;
  
  selecting, by the one or more computers, a speech unit to represent the linguistic unit, the speech unit being selected from among a collection of speech units based on the speech unit representation output by the encoder; and
  
  providing, by the one or more computers and as output of the text-to-speech system, audio data for a synthesized utterance of the text that includes the selected speech unit.
- View Dependent Claims (11, 12, 13)
- - 11. The system of claim 10, wherein the encoder is configured to provide speech unit representations of a same size to represent speech units having different durations.
  - 12. The system of claim 10, wherein the encoder is trained to infer speech unit representations from linguistic unit identifiers, wherein the speech unit representations output by the encoder are vectors that have a same fixed length.
  - 13. The system of claim 10, wherein the encoder comprises a trained neural network having one or more long-short-term memory layers.

14. One or more non-transitory computer-readable media storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising:
- obtaining, by the one or more computers, data indicating a text for text-to-speech synthesis;
  
  providing, by the one or more computers, data indicating a linguistic unit of the text as input to an encoder, the encoder being configured to output speech unit representations indicative of acoustic characteristics based on linguistic information, wherein the encoder is configured to provide speech unit representations learned through machine learning training, wherein the encoder comprises a neural network that was trained as part of an autoencoder network that includes the encoder, a second encoder, and a decoder, wherein;
  
  the encoder is arranged to produce speech unit representations in response to receiving data indicating linguistic units;
  
  the second encoder is arranged to produce speech unit representations in response to receiving data indicating acoustic features of speech units; and
  
  the decoder is arranged to generate output indicating acoustic features of speech units in response to receiving speech unit representations for the speech units from either of the encoder and the second encoder;
  
  receiving, by the one or more computers, a speech unit representation that the encoder outputs in response to receiving the data indicating the linguistic unit as input to the encoder;
  
  selecting, by the one or more computers, a speech unit to represent the linguistic unit, the speech unit being selected from among a collection of speech units based on the speech unit representation output by the encoder; and
  
  providing, by the one or more computers and as output of the text-to-speech system, audio data for a synthesized utterance of the text that includes the selected speech unit.
- View Dependent Claims (15, 16, 17)
- - 15. The one or more non-transitory computer-readable media of claim 14, wherein the encoder is configured to provide speech unit representations of a same size to represent speech units having different durations.
  - 16. The one or more non-transitory computer-readable media of claim 14, wherein the encoder is trained to infer speech unit representations from linguistic unit identifiers, wherein the speech unit representations output by the encoder are vectors that have a same fixed length.
  - 17. The one or more non-transitory computer-readable media of claim 14, wherein the encoder comprises a trained neural network having one or more long-short-term memory layers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Chun, Byung Ha, Gonzalvo, Javier, Chan, Chun-an, Agiomyrgiannakis, Ioannis, Leung Wan, Vincent Ping, Clark, Robert Andrew James, Vit, Jakub
Primary Examiner(s)
Albertalli, Brian L

Application Number

US15/649,311
Publication Number

US 20180268806A1
Time in Patent Office

628 Days
Field of Search
US Class Current
CPC Class Codes

G10L 13/027   Concept to speech synthesis...

G10L 13/047   Architecture of speech synt...

G10L 13/06   Elementary speech units use...

G10L 13/08   Text analysis or generation...

G10L 19/00   Speech or audio signals ana...

G10L 25/30   using neural networks

Text-to-speech synthesis using an autoencoder

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Text-to-speech synthesis using an autoencoder

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links