Speech synthesis system and method

US 20060224391A1
Filed: 09/23/2005
Published: 10/05/2006
Est. Priority Date: 03/29/2005
Status: Active Grant

First Claim

Patent Images

1. A speech synthesis system for generating synthesized speech by segmenting a phonetic sequence derived from an input text by predetermined synthesis units, and by concatenation of representative speech units each of which is extracted from respective one of the synthesis units, the system comprising:

a storage unit configured to store a plurality of speech units corresponding to the synthesis units;

a selector configured to select, with respect to each of the synthesis units of the phonetic sequence derived from the input text, a plurality of speech units from those stored in the storage unit based on a level of distortion of the synthesized speech;

a representative speech generator configured to generate the representative speech unit corresponding to the synthesis unit by calculating a statistics of power information from the speech units, and by correcting the power information based on the statistics of the power information to increase the synthesized speech in sound quality; and

a speech waveform generator configured to generate a speech waveform by concatenating the generated representative speech units.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech synthesis system in a preferred embodiment includes a speech unit storage section, a phonetic environment storage section, a phonetic sequence/prosodic information input section, a plural-speech-unit selection section, a fused-speech-unit sequence generation section, and a fused-speech-unit modification/concatenation section. By fusing a plurality of selected speech units in the fused speech unit sequence generation section, a fused speech unit is generated. In the fused speech unit sequence generation section, the average power information is calculated for a plurality of selected M speech units, N speech units are fused together, and the power information of the fused speech unit is so corrected as to be equalized with the average power information of the M speech units.

Citations

15 Claims

1. A speech synthesis system for generating synthesized speech by segmenting a phonetic sequence derived from an input text by predetermined synthesis units, and by concatenation of representative speech units each of which is extracted from respective one of the synthesis units, the system comprising:
- a storage unit configured to store a plurality of speech units corresponding to the synthesis units;
  
  a selector configured to select, with respect to each of the synthesis units of the phonetic sequence derived from the input text, a plurality of speech units from those stored in the storage unit based on a level of distortion of the synthesized speech;
  
  a representative speech generator configured to generate the representative speech unit corresponding to the synthesis unit by calculating a statistics of power information from the speech units, and by correcting the power information based on the statistics of the power information to increase the synthesized speech in sound quality; and
  
  a speech waveform generator configured to generate a speech waveform by concatenating the generated representative speech units.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The speech synthesis system according to claim 1, wherein the selector selects N speech units and M (N=<
    - M) speech units, respectively, and the representative speech generator calculates an average value of power information from the selected M speech units, generates a fused speech unit by fusing the selected N speech units, and generates the representative speech unit by correcting power information of the fused speech unit to be equalized with the average value of the power information calculated from the M speech units.
  - 3. The speech synthesis system according to claim 1, wherein the selector selects N speech units and M (N=<
    - M) speech units, respectively, and the representative speech generator calculates an average value of power information from the selected M speech units, corrects power information for each of the selected N speech units to be equalized with the average value of the power information, and generates the representative speech unit by fusing the corrected N speech units.
  - 4. The speech synthesis system according to claim 1, wherein the selector selects N speech units and M (N=<
    - M) speech units, respectively, and the representative speech generator calculates an average value of power information of the selected M speech units, calculates an average value of power information of the selected N speech units, calculates a correction value for the use of correcting the average value of the power information of the N speech units to be equalized with the average value of the power information of the M speech units, corrects each of the N speech units by applying the correction value, and generates the representative speech unit by fusing the corrected N speech units.
  - 5. The speech synthesis system according to claim 1, wherein the selector selects N speech units and M (N=<
    - M) speech units, respectively, and the representative speech generator calculates a statistics of power information from the selected M speech units, calculates power information for each of the selected N speech units, determines a weight for each of the N speech units based on the calculated statistics of the power information, and the power information of the N speech units, and generates the representative speech unit by fusing the N speech units base on the weight.
  - 6. The speech synthesis system according to claim 1, wherein the selector selects N speech units and M (N=<
    - M) speech units, respectively, and the representative speech generator from a statistics of power information of the selected M speech units, calculates a section in which a distribution of the power information is of a predetermined probability or more, or a section in which the power information is appropriate, calculates power information for each of the selected N speech units, when the power information of any of the N speech units is not fitting in the section, removes the speech unit not to be selected, and generates the representative speech unit by fusing the removed speech unit.
  - 7. The speech synthesis system according to claim 1, wherein the selector selects M speech units, and an optimum speech unit showing less distortion in the synthesized speech, and the representative speech unit generator calculates an average value of power information from the selected M speech units, and generates the representative speech unit by correcting power information of the optimum speech unit to be equalized with the average value of the power information.
  - 8. The speech synthesis system according to claim 1, wherein the selector selects M speech units, and the representative speech generator from a statistics of power information of the selected M speech units, calculates a section in which a distribution of the power information is of a predetermined probability or more, or a section in which the power information is appropriate, and generates the representative speech unit by selecting an optimum speech unit showing less distortion of the synthesized speech from the speech units having the power information fitting in the section of the power information.
  - 9. The speech synthesis system according to claim 2, wherein only when the power information of the fused speech unit derived by fusing the N speech units is larger than the average value of the power information calculated from the M speech units, the fused speech unit is corrected to equalize the power information of the fused speech unit with the average value of the power information.
  - 10. The speech synthesis system according to claim 3, wherein only when the power information of each of the N speech unit is larger than the average value of the power information calculated from the M speech units, the speech units are corrected to equalize the power information of each of the N speech units with the average value of the power information.
  - 11. The speech synthesis system according to claim 4, wherein only when the average value of the N speech unit is larger than the average value of the power information of the M speech units, a correction value is calculated to make a correction to equalize the average value of the power information of the N speech units with the power information of the M speech units, and the correction value is applied to the N speech units.
  - 12. The speech synthesis system according to claim 7, wherein only when the power information of the selected optimum speech unit is larger than the average value of the power information calculated from the M speech units, the power information of the optimum speech unit is corrected.
  - 13. The speech synthesis system according to at least any one of claims 1 to 12, wherein the power information is a mean square value or a mean absolute amplitude value of the speech waveform.

14. A speech synthesis method for generating a synthesized speech by segmenting a phonetic sequence derived from an input text by predetermined synthesis units, and by concatenating representative speech units each of which is extracted from respective one of the synthesis units, the method comprising the steps of:
- storing a plurality of speech units corresponding to the synthesis unit;
  
  selecting, with respect to each of the synthesis units of the phonetic sequence derived from the input text, a plurality of speech units from the speech units stored in the storage step based on a level of distortion of the synthesized speech;
  
  generating the representative speech unit corresponding to the synthesis unit by calculating a statistics of power information from the speech units, and by correcting the power information based on the statistics of the power information to increase the synthesized speech in sound quality; and
  
  generating a speech waveform by concatenation the generated representative speech unit.

15. A program for use with a computer to implement a speech synthesis method for generating a synthesized speech by segmenting a phonetic sequence derived from an input text by a predetermined synthesis unit, and by concatenation of representative speech units each of which is extracted from respective one of the synthesis units, the program implementing the functions of:
- storing a plurality of speech units corresponding to the synthesis unit;
  
  selecting, with respect to each of the synthesis units of the phonetic sequence derived from the input text, a plurality of speech units from the speech units stored in the storage function based on a level of distortion of the synthesized speech;
  
  generating the representative speech unit corresponding to the synthesis unit by calculating a statistics of power information from the speech units, and by correcting the power information based on the statistics of the power information to increase the synthesized speech in sound quality; and
  
  generating a speech waveform by concatenation the generated representative speech unit.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Hirabayashi, Gou, Kagoshima, Takehiko, Tamura, Masatsune

Granted Patent

US 7,630,896 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/268
CPC Class Codes

G10L 13/07 Concatenation rules

Speech synthesis system and method

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Speech synthesis system and method

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links