Methods for generating pitch and duration contours in a text to speech system

US 6,101,470 A
Filed: 05/26/1998
Issued: 08/08/2000
Est. Priority Date: 05/26/1998
Status: Expired due to Term

First Claim

Patent Images

1. A method for generating pitch contours in a text to speech (TtS) system, the system converting input text into an output acoustic signal simulating natural speech, the method comprising the steps of:

(a) storing a plurality of associated stress and pitch level pairs, each of the plurality of pairs including a lexical stress level and a pitch level;

(b) determining lexical stress levels of the input text;

(c) comparing the stress levels of the input text to the stored stress levels of the plurality of associated stress and pitch levels pairs to find the stored stress levels closest to the stress levels of the input text; and

(d) copying the pitch levels associated with the closest stress levels of the stress and pitch level pairs to generate the pitch contours of the input text.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for automatically generating pitch contours in a text to speech (TtS) system, the system converting input text into an output acoustic signal simulating natural speech, the method comprising the steps of: storing a plurality of associated stress and pitch level pairs, each of the plurality of pairs including a lexical stress level and a pitch level; calculating lexical stress levels of the input text; comparing the stress levels of the input text to the stored stress levels of the plurality of associated stress and pitch level pairs to find the stored stress levels closest to the stress levels of the input text; and copying the pitch levels associated with the closest stored stress levels of the stress and pitch level pairs to generate the pitch contours of the input text. Features illustrative of various modes of the invention include stress and pitch level pairs that correspond with the end of vowels, use of a phonetic dictionary to expand words to phonemes and concatenate stress levels, blocking sentences and the stress contours into constant or variable lengths by segmenting from the ends toward the beginnings, and averaging at the block boundary. The method may distinguish among declarations, questions, and exclamations. Training text may be collected from more than one speaker and scaled; the speaker(s) may wear a laryngograph to provide vocal cord activity.

Citations

41 Claims

1. A method for generating pitch contours in a text to speech (TtS) system, the system converting input text into an output acoustic signal simulating natural speech, the method comprising the steps of:
- (a) storing a plurality of associated stress and pitch level pairs, each of the plurality of pairs including a lexical stress level and a pitch level;
  
  (b) determining lexical stress levels of the input text;
  
  (c) comparing the stress levels of the input text to the stored stress levels of the plurality of associated stress and pitch levels pairs to find the stored stress levels closest to the stress levels of the input text; and
  
  (d) copying the pitch levels associated with the closest stress levels of the stress and pitch level pairs to generate the pitch contours of the input text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
- - 2. The method of claim 1, wherein the stress level and the pitch level of each of the plurality of pairs correspond to an end time of a vowel.
  - 3. The method of claim 1, wherein the stress level is one of a zero stress level corresponding to no stress, a first stress level corresponding to secondary stress, and a second stress level corresponding to primary stress.
  - 4. The method of claim 1, wherein said storing step further comprises the step of training a pitch contour model based on a training text read by at least one speaker to generate the plurality of stress and pitch level pairs, the training text comprising a plurality of training sentences, the plurality of pairs further comprising a plurality of sequences of stress and pitch level pairs, each sequence corresponding to one of the plurality of training sentences.
  - 5. The method of claim 4, wherein said step of training the pitch contour model comprises the steps of:
    - (a) recording speech data and laryngograph data corresponding to the reading of the training sentences by the at least one speaker;
      
      (b) calculating the pitch contour of each of the plurality of training sentences;
      
      (c) time-aligning the speech data to the training text to determine an end-time for each vowel;
      
      (d) calculating the stress contour of each of the plurality of training sentences; and
      
      (e) collating the pitch contours, syllable end-times, and stress contours to generate the sequence of stress and pitch level pairs for each of the plurality of training sentences.
  - 6. The method of claim 5, wherein the pitch contour of each of the plurality of training sentences is calculated from the laryngograph data as a function of time by noting a length of time between impulses.
  - 7. The method of claim 5, wherein the speech data is time-aligned to the training text using the Viterbi algorithm.
  - 8. The method of claim 5, wherein said step of calculating the stress contour of each of the plurality of training sentences comprises the steps of:
    - (a) expanding each word of each of the plurality of training sentences into constituent phonemes according to a phonetic dictionary, the dictionary having a plurality of entries, each entry associated with a word to be synthesized and comprising a sequence of phonemes which form the word and a sequence of stress levels corresponding to vowels in the word; and
      
      (b) concatenating the stress levels of the words in the dictionary forming each of the plurality of training sentences.
  - 9. The method of claim 5, wherein the training sentences are read by a first and a second speaker, average values of the pitch of the first and second speakers are calculated, and the pitch levels corresponding to the second speaker are multiplied by the average value of the pitch of the first speaker and divided by the average value of the pitch of the second speaker.
  - 10. The method of claim 1, wherein the input text comprises a plurality of input sentences, and the step of calculating the stress levels of the input text comprises the steps of:
    - (a) expanding each word of each of the plurality of input sentences into constituent phonemes according to a phonetic dictionary, the dictionary having a plurality of entries, each entry associated with a word to be synthesized and comprising a sequence of phonemes which form the word and a sequence of stress levels corresponding to vowels in the word; and
      
      (b) copying the stress levels of the words in the dictionary forming each of the plurality of input sentences.
  - 11. The method of claim 1, wherein the input text comprises a plurality of input sentences and the plurality of pairs corresponds to a plurality of training sentences read by at least one speaker, said comparing step comprising:
    - (a) segmenting stress contours of the input and training sentences by aligning the ends of the stress contours and respectively segmenting the stress contours from the ends toward the beginnings, to generate a plurality of stress contour input blocks respectively aligned with a plurality of stress contour training blocks, the stress contours including a plurality of stress levels, the ends of the stress contours corresponding to the ends of the sentences; and
      
      (b) respectively comparing the stress levels of each of the plurality of input blocks to the stress levels of each of the plurality of aligned training blocks to obtain a sequence of aligned training blocks having the closest stress levels to the compared input blocks for each of the plurality of input sentences.
  - 12. The method of claim 11, wherein said step of respectively comparing the stress levels of each of the plurality of input blocks to the stress levels of each of the plurality of aligned training blocks further comprises the steps of:
    - calculating a distance between vectors representative of each of the plurality of input blocks to vectors representative of each of the aligned training blocks to obtain the aligned training block having the closest distance to the compared input block for each of the plurality of input blocks, the distance calculation starting from the input block and aligned training blocks corresponding to the end of the input sentence and respectively continuing to the input block and aligned training blocks corresponding to the beginning of the input sentence, for each of the plurality of input sentences; and
      
      concatenating the aligned training blocks having the shortest distances to the respectively compared input blocks for each of the plurality of input sentences.
  - 13. The method of claim 12, wherein the calculated distance between vectors is a Euclidean distance.
  - 14. The method of claim 12, herein the stress contour input and training blocks are the same blocksize.
  - 15. The method of claim 12, wherein the blocksize corresponds to a predefined number of syllables.
  - 16. The method of claim 12, wherein the stress contour input and training blocks are of variable length.
  - 17. The method of claim 16, wherein the variable block length corresponds to a nominal number of predefined syllables plus an additional number of syllables, the nominal number and the additional number of syllables corresponding to a maximum number of syllables that allow an exact match between the stress levels of the input block and the stress levels of the aligned training block.
  - 18. The method of claim 11, wherein the step of comparing the stress levels of each of the plurality of input blocks to the stress levels of each of the aligned training blocks compares stress levels corresponding to an identical utterance type.
  - 19. The method of claim 18, wherein the utterance type is one of a declaration, a question, and an exclamation.
  - 20. The method of claim 11, wherein the pitch level at an edge of the block in the sequence of training blocks is averaged with the pitch level at the edge of a following block.
  - 21. The method of claim 1, wherein the copying step further comprises concatenating the copied pitch levels to generate the pitch contours of the input text.
  - 22. The method of claim 1, further comprising the step of adjusting the pitch levels associated with the closest stress levels when the closest stress levels do not exactly match the corresponding stress levels of the input text.
  - 23. The method of claim 22, wherein said adjusting step comprises the steps of:
    - multiplying the pitch levels associated with the closest stress levels by a first factor, when the closest stress levels are less than the corresponding stress levels of the input text; and
      
      multiplying the pitch levels associated with the closest stress levels by a second factor, when the closest stress levels are greater than the corresponding stress levels of the input text.
  - 24. The method of claim 23, wherein the first factor equals 1.15 and the second factor equals 0.85.
  - 25. The method of claim 22, further comprising the step of linearly interpolating between the adjusted pitch levels forming an adjusted pitch contour to calculate a remainder of each adjusted pitch contour.
  - 26. The method of claim 1, wherein the input text includes a plurality of phonemes, the method further comprising the step of adjusting the durations of the phonemes of the input text based on the stress levels associated with the phonemes.
  - 27. The method of claim 26, wherein the stress level associated with a phoneme is one of a zero stress level corresponding to no stress, a first stress level corresponding to secondary stress, and a second stress level corresponding to primary stress.
  - 28. The method of claim 27, wherein said adjusting step further comprises the steps of:
    - (a) multiplying the durations of each of the plurality of phonemes having the first stress level by a third factor; and
      
      (b) multiplying the durations of each of the plurality of phonemes having the second stress level by a fourth factor.
  - 29. The method of claim 28, wherein the third factor equals 1.08 and the fourth factor equals 1.20.
  - 30. The method of claim 28, wherein the third factor is calculated by dividing an average duration of the plurality of phonemes, independent of the stress level, by an average duration of the phonemes having secondary stress.
  - 31. The method of claim 28, wherein the fourth factor is calculated by dividing an average duration of the plurality of phonemes, independent of the stress level, by an average duration of the phonemes having primary stress.
  - 32. The method of claim 1, further comprising the step of storing the stress levels of the input text in a database.
  - 33. The method of claim 1, wherein the input text includes a plurality of input sentences, the stored stress and pitch level pairs correspond to a plurality of training sentences, and the training and input sentences correspond to a plurality of utterance types.
  - 34. The method of claim 33, wherein each of the plurality of utterance types is identified by a special symbol at an end of one of the training and input sentences.

35. A method for generating duration contours in a text to speech (TtS) system, the system converting input text into an output acoustic signal simulating natural speech, the input text including a plurality of phonemes, the method comprising the steps of:
- determining lexical stress levels of the input text; and
  
  adjusting the durations of the phonemes of the input text by multiplying the durations of each of the plurality of phonemes having a stress level corresponding to primary or secondary lexical stress by a first or a second factor, respectively.
- View Dependent Claims (36, 37, 38, 39)
- - 36. The method of claim 35 wherein a phoneme has one of no stress, the secondary lexical stress, and the primary lexical stress.
  - 37. The method of claim 35, wherein the first factor equals 1.08 and the second factor equals 1.20.
  - 38. The method of claim 35, wherein the first factor is calculated by dividing an average duration of the plurality of phonemes, independent of the stress level, by an average duration of the phonemes having associated secondary stress.
  - 39. The method of claim 35, wherein the fourth factor is calculated by dividing an average duration of the plurality of phonemes, independent of the stress level, by an average duration of the phonemes having associated primary stress.

40. A method for generating pitch contours in a text to speech (TtS) system, the system converting input text into an output acoustic signal simulating natural speech, the input text including a plurality of input sentences, the method comprising the steps of:
- storing a plurality of associated pitch and lexical stress level pairs based on a plurality of training sentences;
  
  determining a stress contour of each of the plurality of input sentences;
  
  segmenting the stress contours of the input and training sentences into a plurality of stress contour input blocks and stress contour training blocks, respectively, by aligning the ends of the input and training stress contours and respectively segmenting the input and training stress contours from the ends towards the beginnings, the ends of the stress contours corresponding to the ends of the sentences;
  
  respectively comparing the stress levels of each of the plurality of input blocks to the stress levels of each of the aligned training blocks to obtain a sequence of training blocks having the closest stress levels to the compared input blocks for each the plurality of input sentences; and
  
  concatenating the pitch levels of the stress and pitch level pairs associated with the sequence of training blocks for each of the plurality of input sentences to form pitch contours for each of the plurality of input sentences.

41. A method for generating pitch contours in a text to speech (TtS) system, the system converting input text into an output acoustic signal simulating natural speech, the input text including a plurality of input sentences, the method comprising the steps of:
- (a) storing a pool of associated stress and pitch level pairs corresponding to a plurality of training sentences read by at least one speaker, each pair having a lexical stress level and a pitch level associated therewith;
  
  (b) generating a lexical stress contour for each of the plurality of input sentences, the stress contours having a plurality of lexical stress levels associated therewith; and
  
  (c) constructing the pitch contour for each of the plurality of input sentences by locating stress levels in the pool similar to the stress levels of the stress contour of each of the plurality of input sentences and copying the associated pitch levels.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Donovan, Robert E., Eide, Ellen M.
Primary Examiner(s)
Hudspeth, David R.
Assistant Examiner(s)
Storm, Donald L.

Application Number

US09/084,679
Time in Patent Office

805 Days
Field of Search

704/260, 704/266, 704/267, 704/268
US Class Current

704/260
CPC Class Codes

G10L 13/10 Prosody rules derived from ...

Methods for generating pitch and duration contours in a text to speech system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

41 Claims

Specification

Solutions

Use Cases

Quick Links

Methods for generating pitch and duration contours in a text to speech system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

41 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links