Speech synthesis method, speech synthesis system, and speech synthesis program
First Claim
Patent Images
1. A speech synthesis method, comprising:
- storing a group of speech units and prosodic information corresponding to each of the speech units of the group in a memory;
segmenting a phoneme string of a target speech to obtain a plurality of segments;
selecting, from the group in the memory, a speech unit for each of the segments based on prosodic information of the target speech to obtain an optimal speech unit sequence including speech units selected for the respective segments;
selecting M (M represents a positive integer greater than one) speech units for each of the segments from the group in the memory, based on the optimal speech unit sequence; and
generating a new speech unit corresponding to each of the segments, by fusing the M speech units selected for each of the segments, to obtain a plurality of new speech units corresponding to the segments respectively;
wherein the selecting the M speech units for each of the segments includes;
setting each of the segments as a target segment;
calculating a first cost for each speech unit of the group in the memory, the first cost representing a difference between the target segment in the target speech and the speech unit of the group;
calculating a second cost for each speech unit of the group in the memory, the second cost representing a degree of distortion produced when the speech unit of the group is concatenated with speech units before and after the target segment in the optimal speech unit sequence; and
selecting the M speech units for the target segment based on the first cost and the second cost of each speech unit of the group.
1 Assignment
0 Petitions
Accused Products
Abstract
A speech synthesis system stores a group of speech units in a memory, selects a plurality of speech units from the group based on prosodic information of target speech, the speech units selected corresponding to each of segments which are obtained by segmenting a phoneme string of the target speech and minimizing distortion of synthetic speech generated from the speech units selected to the target speech, generates a new speech unit corresponding to the each of the segments, by fusing the speech units selected, to obtain a plurality of new speech units corresponding to the segments respectively, and generates synthetic speech by concatenating the new speech units.
-
Citations
5 Claims
-
1. A speech synthesis method, comprising:
-
storing a group of speech units and prosodic information corresponding to each of the speech units of the group in a memory; segmenting a phoneme string of a target speech to obtain a plurality of segments; selecting, from the group in the memory, a speech unit for each of the segments based on prosodic information of the target speech to obtain an optimal speech unit sequence including speech units selected for the respective segments; selecting M (M represents a positive integer greater than one) speech units for each of the segments from the group in the memory, based on the optimal speech unit sequence; and generating a new speech unit corresponding to each of the segments, by fusing the M speech units selected for each of the segments, to obtain a plurality of new speech units corresponding to the segments respectively; wherein the selecting the M speech units for each of the segments includes; setting each of the segments as a target segment; calculating a first cost for each speech unit of the group in the memory, the first cost representing a difference between the target segment in the target speech and the speech unit of the group; calculating a second cost for each speech unit of the group in the memory, the second cost representing a degree of distortion produced when the speech unit of the group is concatenated with speech units before and after the target segment in the optimal speech unit sequence; and selecting the M speech units for the target segment based on the first cost and the second cost of each speech unit of the group. - View Dependent Claims (2, 3, 4)
-
-
5. A speech synthesis system comprising:
-
a memory to store a group of speech units and prosodic information corresponding to each of the speech units of the group; a first selecting unit configured to select, from the group in the memory, a speech unit for each of segments which are obtained by segmenting a phoneme string of a target speech, based on prosodic information of the target speech, to obtain an optimal speech unit sequence including speech units selected for the respective segments; a second selecting unit configured to select, based on the optimal speech unit sequence, M (M represents a positive integer greater than one) speech units for each segment of the segments from the group in the memory; and a generating unit configured to generate a new speech unit corresponding to each of the segments, by fusing the M speech units selected for the segment, to obtain a plurality of new speech units corresponding to the segments respectively; wherein the second selecting unit is configured to; set each segment of the segments as a target segment; calculate a first cost for each speech unit of the group in the memory, the first cost representing a difference between the target segment in the target speech and the speech unit of the group; calculate a second cost for each speech unit of the group in the memory, the second cost representing a degree of distortion produced when the speech unit of the group is concatenated with speech units before and after the target segment in the optimal speech unit sequence; and select the M speech units for the target segment based on the first cost and the second cost of each speech unit of the group.
-
Specification