Method and apparatus for speech synthesis without prosody modification
First Claim
1. A method for synthesizing speech, the method comprising:
- generating a training context vector for each of a set of training speech units in a training speech corpus, each training context vector indicating the prosodic context of a training speech unit in the training speech corpus;
indexing a set of speech segments associated with a set of training speech units based on the context vectors for the training speech units;
generating an input context vector for each of a set of input speech units in an input text, each input context vector indicating the prosodic context of an input speech unit in the input text;
using the input context vectors to find a speech segment for each input speech unit; and
concatenating the found speech segments to form a synthesized speech signal.
2 Assignments
0 Petitions
Accused Products
Abstract
A speech synthesizer is provided that concatenates stored samples of speech units without modifying the prosody of the samples. The present invention is able to achieve a high level of naturalness in synthesized speech with a carefully designed training speech corpus by storing samples based on the prosodic and phonetic context in which they occur. In particular, some embodiments of the present invention limit the training text to those sentences that will produce the most frequent sets of prosodic contexts for each speech unit. Further embodiments of the present invention also provide a multi-tier selection mechanism for selecting a set of samples that will produce the most natural sounding speech.
-
Citations
25 Claims
-
1. A method for synthesizing speech, the method comprising:
-
generating a training context vector for each of a set of training speech units in a training speech corpus, each training context vector indicating the prosodic context of a training speech unit in the training speech corpus;
indexing a set of speech segments associated with a set of training speech units based on the context vectors for the training speech units;
generating an input context vector for each of a set of input speech units in an input text, each input context vector indicating the prosodic context of an input speech unit in the input text;
using the input context vectors to find a speech segment for each input speech unit; and
concatenating the found speech segments to form a synthesized speech signal. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19)
-
-
14. A method of selecting sentences for reading into a training speech corpus used in speech synthesis, the method comprising:
-
identifying a set of prosodic context information for each of a set of speech units;
determining a frequency of occurrence for each distinct context vector that appears in a very large text corpus;
using the frequency of occurrence of the context vectors to identify a list of necessary context vectors; and
selecting sentences in the large text corpus for reading into the training speech corpus, each selected sentence containing at least one necessary context vector.
-
-
20. A method of selecting speech segments for concatenative speech synthesis, the method comprising:
-
parsing an input text into speech units;
identifying context information for each speech unit based on its location in the input text and at least one neighboring speech unit;
identifying a set of candidate speech segments for each speech unit based on the context information; and
identifying a sequence of speech segments from the candidate speech segments based in part on a smoothness cost between the speech segments. - View Dependent Claims (21, 22, 23, 24)
-
-
25. A computer-readable medium having computer executable instructions for synthesizing speech from speech segments based on speech units found in an input text, the speech being synthesized through a method comprising steps of:
-
identifying context information for each speech unit based on the prosodic structure of the input text;
identifying a set of candidate speech segments for each speech unit based on the context information;
identifying a sequence of speech segments from the candidate speech segments;
concatenating the sequence of speech segments without modifying the prosody of the speech segments to form the synthesized speech.
-
Specification