SPEECH SYNTHESIS APPARATUS AND METHOD

US 20080027727A1
Filed: 07/23/2007
Published: 01/31/2008
Est. Priority Date: 07/31/2006
Status: Abandoned Application

First Claim

Patent Images

1. An apparatus for synthesizing speech, comprising:

a speech unit corpus configured to store a group of speech units;

a selection unit configured to divide a phoneme sequence of target speech into a plurality of segments, and to select a combination of speech units for each segment from the speech unit corpus;

an estimation unit configured to estimate a distortion between the target speech and synthesized speech generated by fusing each speech unit of the combination for each segment;

wherein the selection unit recursively selects the combination of speech units for each segment based on the distortion,a fusion unit configured to generate a new speech unit for each segment by fusing each speech unit of the combination selected for each segment; and

a concatenation unit configured to generate synthesized speech by concatenating the new speech unit for each segment.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech unit corpus stores a group of speech units. A selection unit divides a phoneme sequence of target speech into a plurality of segments, and selects a combination of speech units for each segment from the speech unit corpus. An estimation unit estimates a distortion between the target speech and synthesized speech generated by fusing each speech unit of the combination for each segment. The selection unit recursively selects the combination of speech units for each segment based on the distortion. A fusion unit generates a new speech unit for each segment by fusing each speech unit of the combination selected for each segment. A concatenation unit generates synthesized speech by concatenating the new speech unit for each segment.

Citations

20 Claims

1. An apparatus for synthesizing speech, comprising:
- a speech unit corpus configured to store a group of speech units;
  
  a selection unit configured to divide a phoneme sequence of target speech into a plurality of segments, and to select a combination of speech units for each segment from the speech unit corpus;
  
  an estimation unit configured to estimate a distortion between the target speech and synthesized speech generated by fusing each speech unit of the combination for each segment;
  
  wherein the selection unit recursively selects the combination of speech units for each segment based on the distortion,a fusion unit configured to generate a new speech unit for each segment by fusing each speech unit of the combination selected for each segment; and
  
  a concatenation unit configured to generate synthesized speech by concatenating the new speech unit for each segment.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The apparatus according to claim 1,further comprising a speech unit environment corpus configured to store environment information corresponding to each speech unit of the group stored in the speech unit corpus.
  - 3. The apparatus according to claim 2,wherein the environment information includes a unit number, a phoneme, adjacent phonemes in front and rear of the phoneme, a fundamental frequency, a phoneme segmental duration, and a cepstrum coefficient of a start point and an end point of a speech waveform.
  - 4. The apparatus according to claim 3,wherein the speech unit corpus stores the speech waveform corresponding to the unit number.
  - 5. The apparatus according to claim 1,further comprising a phoneme sequence/prosodic information input unit configured to input the phoneme sequence and a prosodic information of the target speech.
  - 6. The apparatus according to claim 1,wherein the selection unit recursively changes the number of speech units of the combination for each segment based on the distortion.
  - 7. The apparatus according to claim 2,wherein the estimation unit extracts the environment information of each speech unit of the combination from the speech unit environment corpus, estimates a phoneme/prosodic environment of the new speech unit based on the environment information extracted, and estimates the distortion based on the phoneme/prosodic environment.
  - 8. The apparatus according to claim 1,wherein the selection unit selects a plurality of combinations of speech units for each segment, andwherein the estimation unit respectively estimates the distortion for each of the plurality of combinations.
  - 9. The apparatus according to claim 8,wherein the selection unit selects one combination of speech units for each segment from the plurality of combinations, the one combination having the minimum distortion among all distortions of the plurality of combinations.
  - 10. The apparatus according to claim 9,wherein the selection unit differently adds at least one speech unit not included in the one combination to the one combination, and selects a plurality of new combinations of speech units for each segment, each of the plurality of new combinations being differently an addition result of the at least one speech unit and the one combination.
  - 11. The apparatus according to claim 10,wherein the estimation unit respectively estimates the distortion for each of the plurality of new combinations, andwherein the selection unit selects one new combination of speech units for each segment from the plurality of new combinations, the one new combination having the minimum distortion among all distortions of the plurality of new combinations.
  - 12. The method according to claim 11,wherein the selection unit recursively selects a plurality of new combinations of speech units for each segment plural times.
  - 13. The method according to claim 4,wherein the fusion unit extracts the speech waveform of each speech unit of the combination of the same segment from the speech unit corpus, equalizes the number of speech waveforms of each speech unit, and fuses the speech waveform equalized of each speech unit.
  - 14. The method according to claim 1,wherein the estimation unit optimally determines a weight between two speech units to minimize the distortion by fusing each speech unit of the combination, andwherein the fusion unit fuses each speech unit of the combination based on the weight.
  - 15. The method according to claim 14,wherein the estimation unit repeatedly determines the weight until the distortion converges as the minimum.
  - 16. The method according to claim 1,wherein the estimation unit estimates the distortion based on a first cost and a second cost,wherein the first cost represents a distortion between the target speech and a synthesized speech generated using the new speech unit of each segment, andwherein the second cost represents a distortion caused by concatenation between the new speech unit of the segment and another new speech unit of another segment adjacent to the segment.
  - 17. The method according to claim 16,wherein the first cost is calculated using at least one of a fundamental frequency, a phoneme segmental duration, a power, a phoneme environment, and a spectral.
  - 18. The method according to claim 16,wherein the second cost is calculated using at least one of a spectral, a fundamental frequency, and a power.

19. A method for synthesizing speech, comprising:
- storing a group of speech units;
  
  dividing a phoneme sequence of target speech into a plurality of segments;
  
  selecting a combination of speech units for each segment from the group of speech units;
  
  estimating a distortion between the target speech and synthesized speech generated by fusing each speech unit of the combination for each segment;
  
  recursively selecting the combination of speech units for each segment based on the distortion;
  
  generating a new speech unit for each segment by fusing each speech unit of the combination selected for each segment; and
  
  generating synthesized speech by concatenating the new speech unit for each segment.

20. A computer program product, comprising:
- a computer readable program code embodied in said product for causing a computer to synthesize speech, said computer readable program code comprising;
  
  a first program code to store a group of speech units;
  
  a second program code to divide a phoneme sequence of target speech into a plurality of segments;
  
  a third program code to select a combination of speech units for each segment from the group of speech units;
  
  a fourth program code to estimate a distortion between the target speech and synthesized speech generated by fusing each speech unit of the combination for each segment;
  
  a fifth program code to recursively select the combination of speech units for each segment based on the distortion;
  
  a sixth program code to generate a new speech unit for each segment by fusing each speech unit of the combination selected for each segment; and
  
  a seventh program code to generate synthesized speech by concatenating the new speech unit for each segment.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
MORITA, Masahiro, Kagoshima, Takehiko

Application Number

US11/781,424
Publication Number

US 20080027727A1
Time in Patent Office

Days
Field of Search
US Class Current

704/267
CPC Class Codes

G10L 13/06 Elementary speech units use...

SPEECH SYNTHESIS APPARATUS AND METHOD

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SPEECH SYNTHESIS APPARATUS AND METHOD

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links