×

Speech synthesis using deep neural networks

  • US 8,527,276 B1
  • Filed: 10/25/2012
  • Issued: 09/03/2013
  • Est. Priority Date: 10/25/2012
  • Status: Active Grant
First Claim
Patent Images

1. A method comprising:

  • training a neural network implemented by one or more processors of a system to map one or more training-time sequences of phonetic-context descriptors received by the neural network into training-time predicted feature vectors that correspond to acoustic properties of predefined speech waveforms, wherein the one or more training-time sequences of phonetic-context descriptors correspond to phonetic transcriptions of training-time text strings, and the training-time text strings correspond to written transcriptions of speech carried in the predefined speech waveforms;

    receiving, by a text analysis module implemented by the one or more processors, a run-time text string;

    processing the received run-time text string with the text analysis module to generate a run-time sequence of phonetic-context descriptors that corresponds to a phonetic transcription of the run-time text string, wherein each phonetic-context descriptor of the run-time sequence includes a respective label identifying a phonetic speech unit of a plurality of phonetic speech units, data indicating phonetic context of the identified phonetic speech unit, and data indicating time duration of the identified phonetic speech unit;

    processing the run-time sequence of the phonetic-context descriptors with the trained neural network in a corresponding sequence of neural network time steps to generate one or more run-time predicted feature vectors; and

    processing the one or more run-time predicted feature vectors with a signal generation module to produce and output a run-time speech waveform corresponding to a spoken rendering of the received run-time text string,wherein processing the received run-time text string with the text analysis module to generate the run-time sequence of phonetic-context descriptors comprises;

    generating a run-time transcription sequence of phonetic speech units that corresponds to the phonetic transcription of the run-time text string; and

    determining a respective number of consecutive phonetic-context descriptors to generate for each of the phonetic speech units of the run-time transcription sequence.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×