System and method for predicting prosodic parameters
First Claim
1. An automatic prosodic labeler for predicting prosodic parameters from annotated speech files, the automatic prosodic labeler comprising:
- a first module that makes binary decisions about where to place accents and boundaries;
a second module that predicts a plurality of fundamental frequency targets per syllable and that predicts a z-score for each phone; and
a third module that labels speech with the binary decisions and that applies normalized duration features as acoustic features, wherein an iterative classification and regression tree (CART) growing process alternates between prosody prediction from text and prosody recognition from text plus speech to generate improved CARTs for predicting prosody parameters from preprocessed text.
10 Assignments
0 Petitions
Accused Products
Abstract
A method for generating a prosody model that predicts prosodic parameters is disclosed. Upon receiving text annotated with acoustic features, the method comprises generating first classification and regression trees (CARTs) that predict durations and F0 from text by generating initial boundary labels by considering pauses, generating initial accent labels by applying a simple rule on text-derived features only, adding the predicted accent and boundary labels to feature vectors, and using the feature vectors to generate the first CARTs. The first CARTs are used to predict accent and boundary labels. Next, the first CARTs are used to generate second CARTs that predict durations and F0 from text and acoustic features by using lengthened accented syllables and phrase-final syllables, refining accent and boundary models simultaneously, comparing actual and predicted duration of a whole prosodic phrase to normalize speaking rate, and generating the second CARTs that predict the normalized speaking rate.
-
Citations
17 Claims
-
1. An automatic prosodic labeler for predicting prosodic parameters from annotated speech files, the automatic prosodic labeler comprising:
-
a first module that makes binary decisions about where to place accents and boundaries; a second module that predicts a plurality of fundamental frequency targets per syllable and that predicts a z-score for each phone; and a third module that labels speech with the binary decisions and that applies normalized duration features as acoustic features, wherein an iterative classification and regression tree (CART) growing process alternates between prosody prediction from text and prosody recognition from text plus speech to generate improved CARTs for predicting prosody parameters from preprocessed text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of generating a prosody model for generating synthetic speech from text-derived annotated speech files, the method comprising:
-
(1) adding predicted linguistic features to text-derived annotations in the speech files; (2) adding normalized syllable durations to the annotations; (3) adding a plurality of extracted acoustic features to the annotations; (4) generating initial accent and boundary labels by considering pauses and relative syllable durations; (5) training CARTs to predict durations and F0s from the added predicted linguistic features and prosodic labels; (6) training refined CARTs to predict normalized durations; (7) training a first classifier to label accents and boundaries by; (a) training a classifier to recognize predicted accent and predicted boundary labels; (b) training the refined CARTs to output accent and boundary probabilities from linguistic features and relative syllable durations; (c) relabeling the annotations; (8) training the refined CARTs to predict accents and boundaries from linguistic features only; (9) relabeling the annotations; and (10) returning to step (5) until prosodic labels stabilize. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. A computer readable medium storing instructions for controlling a computer device to perform a method of generating a prosody model from text-derived annotated speech files for use in prosody prediction, the method comprising:
-
(1) adding predicted linguistic features to text-derived annotations in the speech files; (2) adding normalized syllable durations to the annotations; (3) adding a plurality of extracted acoustic features to the annotations; (4) generating initial accent and boundary labels by considering pauses and relative syllable durations; (5) training CARTs to predict durations and F0s from the added predicted linguistic features and prosodic labels; (6) training refined CARTs to predict normalized durations; (7) training a first classifier to label accents and boundaries by; (a) training a classifier to recognize predicted accent and predicted boundary labels; (b) training the refined CARTs to output accent and boundary probabilities from linguistic features and relative syllable durations; (c) relabeling the annotations; (8) training the refined CARTs to predict accents and boundaries from linguistic features only; (9) relabeling the annotations; and (10) returning to step (5) until prosodic labels stabilize.
-
Specification