Speech synthesis method utilizing auxiliary information, medium recorded thereon the method and apparatus utilizing the method

US 5,940,797 A
Filed: 09/18/1997
Issued: 08/17/1999
Est. Priority Date: 09/24/1996
Status: Expired due to Term

First Claim

Patent Images

1. A text speech synthesis method by rule which synthesizes arbitrary speech through the use of an input text, said method comprising the steps of:

(a) analyzing said input text by reference to a word dictionary and identifying a sequence of words in said input text to obtain a sequence of phonemes of each word;

(b) setting a fundamental frequency, a power and a phoneme duration specified for each phoneme of said each word as first prosodic parameters on the basis of said word dictionary;

(c) selecting from a speech waveform dictionary phoneme waveforms corresponding to said phonemes in said each word to thereby generate a sequence of phoneme waveforms;

(d) extracting a fundamental frequency, a speech power and a phoneme duration as second prosodic parameters from input actual speech;

(e) selecting at least one of said first prosodic parameters or at least one of said second prosodic parameters as a selected prosodic parameter; and

(f) generating synthesized speech by controlling said sequence of phoneme waveforms with said selected prosodic parameter.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a method and apparatus which use actual speech as auxiliary information and synthesize speech by speech synthesis by rule, prosodic information for a phoneme sequence of each word of a word sequence obtained by an analysis of an input text is set by referring to a word dictionary, and a speech waveform sequence is obtained from the phoneme sequence of each word by referring to a speech waveform dictionary. Additional prosodic information is extracted from input actual speech, and at least one of the set prosodic information or at least one of the extracted prosodic information is selected and used to control the speech waveform sequence to create synthesized speech.

76 Citations

View as Search Results

20 Claims

1. A text speech synthesis method by rule which synthesizes arbitrary speech through the use of an input text, said method comprising the steps of:
- (a) analyzing said input text by reference to a word dictionary and identifying a sequence of words in said input text to obtain a sequence of phonemes of each word;
  
  (b) setting a fundamental frequency, a power and a phoneme duration specified for each phoneme of said each word as first prosodic parameters on the basis of said word dictionary;
  
  (c) selecting from a speech waveform dictionary phoneme waveforms corresponding to said phonemes in said each word to thereby generate a sequence of phoneme waveforms;
  
  (d) extracting a fundamental frequency, a speech power and a phoneme duration as second prosodic parameters from input actual speech;
  
  (e) selecting at least one of said first prosodic parameters or at least one of said second prosodic parameters as a selected prosodic parameter; and
  
  (f) generating synthesized speech by controlling said sequence of phoneme waveforms with said selected prosodic parameter.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein said step (e) includes a step of selecting at least one of said second prosodic parameters and said first prosodic parameters corresponding to the remaining second prosodic parameters other than said at least one of said second prosodic parameters.
  - 3. The method of claim 1, further comprising a step of extracting a desired band of said input actual speech and mixing it with another band of said synthesized speech to create synthesized speech for output.
  - 4. The method of claim 1 or claim 2, wherein said phoneme duration in said selected prosodic parameters, which represents start and end points of said each phoneme, is output as a speech synchronizing signal to be used externally.
  - 5. The method of claim 1 or claim 2, wherein a sentence of said actual speech and a sentence of said text are the same.
  - 6. The method of claim 1 or 2, wherein a sentence of said actual speech and a sentence of said text differ from each other.
  - 7. The method of claim 1, wherein said step (d) includes a step of storing said second prosodic parameters in a memory and said step (e) includes a step of reading out at least one part of said second prosodic parameters from said memory.
  - 8. The method of claim 1, further comprising a step of displaying at least one of said extracted fundamental frequency, speech power and phoneme duration on a display screen and correcting an extraction error.

9. A speech synthesizer for synthesizing speech corresponding to input text by speech synthesis by rule, said synthesizer comprising:
- text analysis means for sequentially identifying a sequence of words forming said input text by reference to a word dictionary to thereby obtain a sequence of phonemes of each word;
  
  prosodic parameter setting means for setting first prosodic parameters for each phoneme in said each word that is set in said word dictionary in association with said each word, said prosodic parameter setting means including fundamental frequency setting means, speech power setting means and duration setting means for setting, respectively, a fundamental frequency, speech power and duration of each phoneme as said first prosodic parameters for said each word provided in said word dictionary in association with said each word;
  
  speech segment select means for selectively reading out of a speech waveform dictionary a speech waveform corresponding to said each phoneme in each of said identified words;
  
  prosodic parameter extracting means for extracting second prosodic parameters from input actual speech, said prosdic parameter extracting means including fundamental frequency extracting means, speech power extracting means and duration extracting means for extracting, respectively, a fundamental frequency, a speech power and a phoneme duration as said second prosodic parameters from said input actual speech through a fixed analysis window at a regular time interval;
  
  prosodic parameter select means for selecting at least one of said first prosodic parameters or at least one of said second prosodic parameters as a selected prosodic parameter; and
  
  speech synthesizing means for controlling said selected speech waveform by said selected prosodic parameters and for outputting said synthesized speech.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The synthesizer of claim 9, wherein either one of said phoneme duration in said first and second prosodic parameters is output as a synchronizing signal to be used externally.
  - 11. The synthesizer of claim 9, which further comprises memory means for storing said second prosodic parameters and wherein said select means reads out at least one part of said second prosodic parameters from said memory means.
  - 12. The synthesizer of claim 9, further comprising first filter means for passing therethrough a predetermined first band of said input actual speech, second filter means for passing therethrough a second band of synthesized speech from said speech synthesizing means that differs from said first band, and combining means for combining the outputs from said first and second filter means into synthesized speech for output.
  - 13. The synthesizer of claim 12, wherein said first filter means is a high-pass filter for passing a band higher than said fundamental frequency and said second filter means is a low-pass filter for passing a band containing said fundamental frequency and frequencies lower than the band of said first filter means.
  - 14. The synthesizer of claim 9, further comprising display means for displaying said second prosodic parameters and a prosodic information graphical user interface for modifying said second prosodic parameters by correcting an error of said second prosodic parameters displayed on the display screen.
  - 15. The synthesizer of claim 14, wherein said prosodic information graphical user interface includes fundamental frequency editor means for modifying said extracted fundamental frequency in response to a correction of said displayed fundamental frequency, speech power editor means for modifying said extracted speech power in response to a correction of said displayed speech power, and phoneme duration editor means for modifying said extracted phoneme duration in response to a correction of said displayed phoneme duration.
  - 16. The synthesizer of claim 15, wherein said display means includes speech editor means for displaying a speech symbol sequence provided from said text analysis means and for correcting an error in a speech symbol sequence displayed by said display means to thereby correct the corresponding error in said speech symbol sequence.

17. A recording medium which has recorded thereon a procedure for synthesizing arbitrary speech by rule from an input text, said procedure comprising the steps of:
- (a) analyzing said input text by reference to a word dictionary and identifying a sequence of words in said input text to obtain a sequence of phonemes of each word;
  
  (b) setting first prosodic parameters for each of said phonemes in said each word;
  
  (c) selecting from a speech waveform dictionary phoneme waveforms corresponding to said phonemes in said each word to thereby generate a sequence of phoneme waveforms;
  
  (d) extracting a fundamental frequency, a speech power and a phoneme duration from input actual speech as second prosodic parameters;
  
  (e) selecting at least one of said first prosodic parameters or at least one of said second prosodic parameters as a selected prosodic parameter; and
  
  (f) generating synthesized speech by controlling said sequence of phoneme waveforms with said selected prosodic parameters.
- View Dependent Claims (18, 19, 20)
- - 18. The recording medium of claim 17, wherein said procedure further comprises a step of extracting a desired band of said input actual speech and mixing it with another band of said synthesized speech to create synthesized speech for output.
  - 19. The recording medium of claim 17, wherein said step (d) includes a step of storing said second prosodic parameters in a memory and said step (e) includes a step of reading out at least one of said second prosodic parameters from said memory.
  - 20. The recording medium of claim 17, wherein said procedure includes a step of displaying at least one of said extracted fundamental frequency, speech power and phoneme duration on a display screen and correcting an extraction error.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nippon Telegraph and Telephone Corporation
Original Assignee
Nippon Telegraph and Telephone Corporation
Inventors
Abe, Masanobu
Primary Examiner(s)
Hudspeth, David R.
Assistant Examiner(s)
Opsasnick, Michael N.

Application Number

US08/933,140
Time in Patent Office

698 Days
Field of Search

704/258, 704/259, 704/267, 704/268, 704/260
US Class Current

704/260
CPC Class Codes

G10L 13/10 Prosody rules derived from ...

Speech synthesis method utilizing auxiliary information, medium recorded thereon the method and apparatus utilizing the method

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

76 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Speech synthesis method utilizing auxiliary information, medium recorded thereon the method and apparatus utilizing the method

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

76 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links