Methods for generating pitch and duration contours in a text to speech system
First Claim
1. A method for generating pitch contours in a text to speech (TtS) system, the system converting input text into an output acoustic signal simulating natural speech, the method comprising the steps of:
- (a) storing a plurality of associated stress and pitch level pairs, each of the plurality of pairs including a lexical stress level and a pitch level;
(b) determining lexical stress levels of the input text;
(c) comparing the stress levels of the input text to the stored stress levels of the plurality of associated stress and pitch levels pairs to find the stored stress levels closest to the stress levels of the input text; and
(d) copying the pitch levels associated with the closest stress levels of the stress and pitch level pairs to generate the pitch contours of the input text.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for automatically generating pitch contours in a text to speech (TtS) system, the system converting input text into an output acoustic signal simulating natural speech, the method comprising the steps of: storing a plurality of associated stress and pitch level pairs, each of the plurality of pairs including a lexical stress level and a pitch level; calculating lexical stress levels of the input text; comparing the stress levels of the input text to the stored stress levels of the plurality of associated stress and pitch level pairs to find the stored stress levels closest to the stress levels of the input text; and copying the pitch levels associated with the closest stored stress levels of the stress and pitch level pairs to generate the pitch contours of the input text. Features illustrative of various modes of the invention include stress and pitch level pairs that correspond with the end of vowels, use of a phonetic dictionary to expand words to phonemes and concatenate stress levels, blocking sentences and the stress contours into constant or variable lengths by segmenting from the ends toward the beginnings, and averaging at the block boundary. The method may distinguish among declarations, questions, and exclamations. Training text may be collected from more than one speaker and scaled; the speaker(s) may wear a laryngograph to provide vocal cord activity.
-
Citations
41 Claims
-
1. A method for generating pitch contours in a text to speech (TtS) system, the system converting input text into an output acoustic signal simulating natural speech, the method comprising the steps of:
-
(a) storing a plurality of associated stress and pitch level pairs, each of the plurality of pairs including a lexical stress level and a pitch level; (b) determining lexical stress levels of the input text; (c) comparing the stress levels of the input text to the stored stress levels of the plurality of associated stress and pitch levels pairs to find the stored stress levels closest to the stress levels of the input text; and (d) copying the pitch levels associated with the closest stress levels of the stress and pitch level pairs to generate the pitch contours of the input text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
-
-
35. A method for generating duration contours in a text to speech (TtS) system, the system converting input text into an output acoustic signal simulating natural speech, the input text including a plurality of phonemes, the method comprising the steps of:
-
determining lexical stress levels of the input text; and adjusting the durations of the phonemes of the input text by multiplying the durations of each of the plurality of phonemes having a stress level corresponding to primary or secondary lexical stress by a first or a second factor, respectively. - View Dependent Claims (36, 37, 38, 39)
-
-
40. A method for generating pitch contours in a text to speech (TtS) system, the system converting input text into an output acoustic signal simulating natural speech, the input text including a plurality of input sentences, the method comprising the steps of:
-
storing a plurality of associated pitch and lexical stress level pairs based on a plurality of training sentences; determining a stress contour of each of the plurality of input sentences; segmenting the stress contours of the input and training sentences into a plurality of stress contour input blocks and stress contour training blocks, respectively, by aligning the ends of the input and training stress contours and respectively segmenting the input and training stress contours from the ends towards the beginnings, the ends of the stress contours corresponding to the ends of the sentences; respectively comparing the stress levels of each of the plurality of input blocks to the stress levels of each of the aligned training blocks to obtain a sequence of training blocks having the closest stress levels to the compared input blocks for each the plurality of input sentences; and concatenating the pitch levels of the stress and pitch level pairs associated with the sequence of training blocks for each of the plurality of input sentences to form pitch contours for each of the plurality of input sentences.
-
-
41. A method for generating pitch contours in a text to speech (TtS) system, the system converting input text into an output acoustic signal simulating natural speech, the input text including a plurality of input sentences, the method comprising the steps of:
-
(a) storing a pool of associated stress and pitch level pairs corresponding to a plurality of training sentences read by at least one speaker, each pair having a lexical stress level and a pitch level associated therewith; (b) generating a lexical stress contour for each of the plurality of input sentences, the stress contours having a plurality of lexical stress levels associated therewith; and (c) constructing the pitch contour for each of the plurality of input sentences by locating stress levels in the pool similar to the stress levels of the stress contour of each of the plurality of input sentences and copying the associated pitch levels.
-
Specification