Inserting breath sounds into text-to-speech output
First Claim
1. A computer-implemented method of generating speech including audible breath sounds, the method comprising:
- receiving input text for text-to-speech (TTS) processing;
identifying punctuation in the input text;
determining a first location in the input text for insertion of a breath sound based at least in part on the punctuation;
determining a second location in the input text for insertion of a breath sound based at least in part on the punctuation;
determining a linguistic distance between the first location and second location;
using a cost function to identify a first breath unit for the first location, the cost function based at least in part on the identified punctuation, the linguistic distance between the first location and second location, and a linguistic context of the first location;
using the cost function to identify a second breath unit for the second location, the cost function based at least in part on the identified punctuation, the linguistic distance between the first location and second location, and a linguistic context of the second location; and
synthesizing speech corresponding to the input text, wherein the synthesized speech comprises a first breath sound corresponding to the first breath unit at the first location and a second breath sound corresponding to the second breath unit at the second location.
1 Assignment
0 Petitions
Accused Products
Abstract
A text-to-speech (TTS) system may be configured to incorporate breath sounds in the output speech. By incorporating breath sounds into speech output from text a TTS system may be able to mimic more naturally sounding human speech, particularly for long-form narration of text longer than short phrases. The breath sounds may be stored as units for unit selection or may be generated during parametric synthesis. The acoustic features of the breath sounds and duration between breaths may depend upon the punctuation of text, the linguistic distance between breaths, the breaks between intonational phrases, the linguistic context of the breaths, and other factors.
-
Citations
23 Claims
-
1. A computer-implemented method of generating speech including audible breath sounds, the method comprising:
-
receiving input text for text-to-speech (TTS) processing; identifying punctuation in the input text; determining a first location in the input text for insertion of a breath sound based at least in part on the punctuation; determining a second location in the input text for insertion of a breath sound based at least in part on the punctuation; determining a linguistic distance between the first location and second location; using a cost function to identify a first breath unit for the first location, the cost function based at least in part on the identified punctuation, the linguistic distance between the first location and second location, and a linguistic context of the first location; using the cost function to identify a second breath unit for the second location, the cost function based at least in part on the identified punctuation, the linguistic distance between the first location and second location, and a linguistic context of the second location; and synthesizing speech corresponding to the input text, wherein the synthesized speech comprises a first breath sound corresponding to the first breath unit at the first location and a second breath sound corresponding to the second breath unit at the second location. - View Dependent Claims (2, 3)
-
-
4. A computing system, comprising:
-
at least one processor; a memory device including instructions operable to be executed by the at least one processor to perform a set of actions, configuring the at least one processor; to receive input text; to identify a location in the input text for a breath; to identify a duration for the breath in the location; to determine a breath sound for the location, the breath sound determined from a plurality of breath sounds; and to synthesize speech corresponding to the input text using the duration and data corresponding to the breath sound, the synthesized speech comprising the breath sound at substantially the location. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A non-transitory computer-readable storage medium storing processor-executable instructions for controlling a computing device, comprising:
-
program code to receive input text; program code to identify a location in the input text for a breath; program code to identify a duration for the breath in the location; program code to determine a breath sound for the location, the breath sound determined from a plurality of breath sounds; and program code to synthesize speech corresponding to the input text using the duration and data corresponding to the breath sound, the synthesized speech comprising the breath sound at substantially the location. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23)
-
Specification