System and method for predicting prosodic parameters

US 7,136,816 B1
Filed: 12/24/2002
Issued: 11/14/2006
Est. Priority Date: 04/05/2002
Status: Active Grant

First Claim

Patent Images

1. An automatic prosodic labeler for predicting prosodic parameters from annotated speech files, the automatic prosodic labeler comprising:

a first module that makes binary decisions about where to place accents and boundaries;

a second module that predicts a plurality of fundamental frequency targets per syllable and that predicts a z-score for each phone; and

a third module that labels speech with the binary decisions and that applies normalized duration features as acoustic features, wherein an iterative classification and regression tree (CART) growing process alternates between prosody prediction from text and prosody recognition from text plus speech to generate improved CARTs for predicting prosody parameters from preprocessed text.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for generating a prosody model that predicts prosodic parameters is disclosed. Upon receiving text annotated with acoustic features, the method comprises generating first classification and regression trees (CARTs) that predict durations and F0 from text by generating initial boundary labels by considering pauses, generating initial accent labels by applying a simple rule on text-derived features only, adding the predicted accent and boundary labels to feature vectors, and using the feature vectors to generate the first CARTs. The first CARTs are used to predict accent and boundary labels. Next, the first CARTs are used to generate second CARTs that predict durations and F0 from text and acoustic features by using lengthened accented syllables and phrase-final syllables, refining accent and boundary models simultaneously, comparing actual and predicted duration of a whole prosodic phrase to normalize speaking rate, and generating the second CARTs that predict the normalized speaking rate.

Citations

17 Claims

1. An automatic prosodic labeler for predicting prosodic parameters from annotated speech files, the automatic prosodic labeler comprising:
- a first module that makes binary decisions about where to place accents and boundaries;
  
  a second module that predicts a plurality of fundamental frequency targets per syllable and that predicts a z-score for each phone; and
  
  a third module that labels speech with the binary decisions and that applies normalized duration features as acoustic features, wherein an iterative classification and regression tree (CART) growing process alternates between prosody prediction from text and prosody recognition from text plus speech to generate improved CARTs for predicting prosody parameters from preprocessed text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The prosodic labeler of claim 1, wherein the first module comprises CARTs that generate initial accent and boundary labels by considering pauses and relative syllable durations.
  - 3. The prosodic labeler of claim 2, wherein the second module comprises CARTs that predict three F0 targets per syllable.
  - 4. The prosodic labeler of claim 2, wherein the first module further makes initial accent labels applying a simple rule on text-derived features only.
  - 5. The prosodic labeler of claim 1, wherein the third module further comprises CARTs.
  - 6. The prosodic labeler of claim 1, wherein pause durations and syllable durations, obtained from phonetic segmentation and normalization, are added to textual features in the annotated speech files.
  - 7. The prosodic labeler of claim 1, wherein the annotations in the annotated speech files relate to words, punctuation, pronunciation, word and syllable boundaries, lexical stress and parts of speech.
  - 8. The prosodic labeler of claim 7, wherein the prosodic labeler extracts F0 contours from the annotated speech files, interpolates for unvoiced regions, takes three samples per syllable, performs a cluster analysis, and adds quantized F0s to the annotations.
  - 9. The prosodic labeler of claim 1, wherein the iterative CART growing process further comprises:
    - (1) adding predicted linguistic features to text-derived annotations in the speech files;
      
      (2) adding normalized syllable durations to the annotations;
      
      (3) adding a plurality of extracted acoustic features to the annotations;
      
      (4) generating initial accent and boundary labels by considering pauses and relative syllable durations;
      
      (5) training CARTs to predict durations and F0s from the added predicted linguistic features and prosodic labels;
      
      (6) training refined CARTs to predict normalized durations;
      
      (7) training a first classifier to label accents and boundaries by;
      
      (a) training an n-next-neighborhood classifier to recognize predicted accent and predicted boundary labels;
      
      (b) training the refined CARTs to output accent and boundary probabilities from linguistic features and relative syllable durations;
      
      (c) relabeling the annotations;
      
      (8) training the refined CARTs to predict accents and boundaries from linguistic features only;
      
      (9) relabeling the annotations; and
      
      (10) returning to step (5) until prosodic labels stabilize.

10. A method of generating a prosody model for generating synthetic speech from text-derived annotated speech files, the method comprising:
- (1) adding predicted linguistic features to text-derived annotations in the speech files;
  
  (2) adding normalized syllable durations to the annotations;
  
  (3) adding a plurality of extracted acoustic features to the annotations;
  
  (4) generating initial accent and boundary labels by considering pauses and relative syllable durations;
  
  (5) training CARTs to predict durations and F0s from the added predicted linguistic features and prosodic labels;
  
  (6) training refined CARTs to predict normalized durations;
  
  (7) training a first classifier to label accents and boundaries by;
  
  (a) training a classifier to recognize predicted accent and predicted boundary labels;
  
  (b) training the refined CARTs to output accent and boundary probabilities from linguistic features and relative syllable durations;
  
  (c) relabeling the annotations;
  
  (8) training the refined CARTs to predict accents and boundaries from linguistic features only;
  
  (9) relabeling the annotations; and
  
  (10) returning to step (5) until prosodic labels stabilize.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The method of claim 10, further comprising, to generate the plurality of extracted acoustic features:
    - extracting F0 contours from the annotated speech files;
      
      interpolating in unvoiced regions;
      
      taking three samples per syllable;
      
      performing a cluster analysis; and
      
      adding quantized F0s to the annotations.
  - 12. The method of claim 11, wherein the cluster analysis is performed to obtain a plurality of prototypes representing different shapes of the F0 contours.
  - 13. The method of claim 10, wherein the added linguistic features relate to a yes-no question.
  - 14. The method of claim 10, wherein the annotations in the annotated speech files comprise words, punctuation, pronunciation, word and syllable boundaries, lexical stress and parts-of-speech.
  - 15. The method of claim 11, wherein the plurality of extracted features comprises eleven extracted features.
  - 16. The method of claim 10, further comprising, after step (6), optionally returning to step (5) to remake the CARTs.

17. A computer readable medium storing instructions for controlling a computer device to perform a method of generating a prosody model from text-derived annotated speech files for use in prosody prediction, the method comprising:
- (1) adding predicted linguistic features to text-derived annotations in the speech files;
  
  (2) adding normalized syllable durations to the annotations;
  
  (3) adding a plurality of extracted acoustic features to the annotations;
  
  (4) generating initial accent and boundary labels by considering pauses and relative syllable durations;
  
  (5) training CARTs to predict durations and F0s from the added predicted linguistic features and prosodic labels;
  
  (6) training refined CARTs to predict normalized durations;
  
  (7) training a first classifier to label accents and boundaries by;
  
  (a) training a classifier to recognize predicted accent and predicted boundary labels;
  
  (b) training the refined CARTs to output accent and boundary probabilities from linguistic features and relative syllable durations;
  
  (c) relabeling the annotations;
  
  (8) training the refined CARTs to predict accents and boundaries from linguistic features only;
  
  (9) relabeling the annotations; and
  
  (10) returning to step (5) until prosodic labels stabilize.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
AT&T Corporation (AT&T, Inc.)
Inventors
Strom, Volker Franz
Primary Examiner(s)
Knepper, David D.

Application Number

US10/329,181
Time in Patent Office

1,421 Days
Field of Search

None
US Class Current

704/260
CPC Class Codes

G10L 13/04 Details of speech synthesis...

G10L 13/10 Prosody rules derived from ...

System and method for predicting prosodic parameters

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for predicting prosodic parameters

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links