SPEECH SYNTHESIS SYSTEM AND SPEECH SYNTHESIS METHOD
First Claim
1. A speech synthesis system comprising:
- a dividing unit configured to divide a phoneme string corresponding to target speech into a plurality of segments to generate a first segment sequence;
a selecting unit configured to generate a plurality of first speech unit strings corresponding to the first segment sequence by combining a plurality of speech units based on the first segment sequence and select one speech unit string from said plurality of first speech unit strings; and
a concatenation unit configured to concatenate a plurality of speech units included in the selected speech unit string to generate synthetic speech,the selecting unit includinga searching unit configured to perform repeatedly a first processing and a second processing, the first processing generating, based on maximum W (W is a predetermined value) second speech unit strings corresponding to a second segment sequence as a partial sequence of the first segment sequence, a plurality of third speech unit strings corresponding to a third segment sequence as a partial sequence obtained by adding a segment to the second segment sequence, and the second processing selecting maximum W third speech unit strings from said plurality of third speech unit strings,a first calculation unit configured to calculate a total cost of each of said plurality of third speech unit strings,a second calculation unit configured to calculate a penalty coefficient corresponding to the total cost for each of said plurality of third speech unit strings based on a restriction concerning quickness of speech unit data acquisition, wherein the penalty coefficient depending on extent in which the restriction is approached, anda third calculation unit configured to calculate a evaluation value of each of said plurality of third speech unit strings by correcting the total cost with the penalty coefficient,wherein the searching unit selects the maximum W third speech unit strings from said plurality of third speech unit strings based on the evaluation value of each of said plurality of third speech unit strings.
1 Assignment
0 Petitions
Accused Products
Abstract
In a speech synthesis, a selecting unit selects one string from first speech unit strings corresponding to a first segment sequence obtained by dividing a phoneme string corresponding to target speech into segments. The selecting unit performs repeatedly generating, based on maximum W second speech unit strings corresponding to a second segment sequence as a partial sequence of the first sequence, third speech unit strings corresponding to a third segment sequence obtained by adding a segment to the second sequence, and selecting maximum W strings from the third strings based on a evaluation value of each of the third strings. The value is obtained by correcting a total cost of each of the third string candidate with a penalty coefficient for each of the third strings. The coefficient is based on a restriction concerning quickness of speech unit data acquisition, and depends on extent in which the restriction is approached.
17 Citations
24 Claims
-
1. A speech synthesis system comprising:
-
a dividing unit configured to divide a phoneme string corresponding to target speech into a plurality of segments to generate a first segment sequence; a selecting unit configured to generate a plurality of first speech unit strings corresponding to the first segment sequence by combining a plurality of speech units based on the first segment sequence and select one speech unit string from said plurality of first speech unit strings; and a concatenation unit configured to concatenate a plurality of speech units included in the selected speech unit string to generate synthetic speech, the selecting unit including a searching unit configured to perform repeatedly a first processing and a second processing, the first processing generating, based on maximum W (W is a predetermined value) second speech unit strings corresponding to a second segment sequence as a partial sequence of the first segment sequence, a plurality of third speech unit strings corresponding to a third segment sequence as a partial sequence obtained by adding a segment to the second segment sequence, and the second processing selecting maximum W third speech unit strings from said plurality of third speech unit strings, a first calculation unit configured to calculate a total cost of each of said plurality of third speech unit strings, a second calculation unit configured to calculate a penalty coefficient corresponding to the total cost for each of said plurality of third speech unit strings based on a restriction concerning quickness of speech unit data acquisition, wherein the penalty coefficient depending on extent in which the restriction is approached, and a third calculation unit configured to calculate a evaluation value of each of said plurality of third speech unit strings by correcting the total cost with the penalty coefficient, wherein the searching unit selects the maximum W third speech unit strings from said plurality of third speech unit strings based on the evaluation value of each of said plurality of third speech unit strings. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A speech synthesis method comprising:
-
dividing a phoneme string corresponding to target speech into a plurality of segments to generate a first segment sequence; generating a plurality of first speech unit strings corresponding to the first segment sequence by combining a plurality of speech units based on the first segment sequence and selecting one speech unit string from said plurality of first speech unit strings; and concatenating a plurality of speech units included in the selected speech unit string to generate synthetic speech, the generating/selecting including performing repeatedly a first processing and a second processing, the first processing generating, based on maximum W (W is a predetermined value) second speech unit strings corresponding to a second segment sequence as a partial sequence of the first segment sequence, a plurality of third speech unit strings corresponding to a third segment sequence as a partial sequence obtained by adding a segment to the second segment sequence, and the second processing selecting maximum W third speech unit strings from said plurality of third speech unit strings, calculating a total cost of each of said plurality of third speech unit strings, calculating a penalty coefficient corresponding to the total cost for each of said plurality of third speech unit strings based on a restriction concerning quickness of speech unit data acquisition, wherein the penalty coefficient depending on extent in which the restriction is approached, and calculating a evaluation value of each of said plurality of third speech unit strings by correcting the total cost with the penalty coefficient, wherein the second processing including selecting the maximum W third speech unit strings from said plurality of third speech unit strings based on the evaluation value of each of said plurality of third speech unit strings. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer readable storage medium storing instructions of a computer program which when executed by a computer results in performance of steps comprising:
-
dividing a phoneme string corresponding to target speech into a plurality of segments to generate a first segment sequence; generating a plurality of first speech unit strings corresponding to the first segment sequence by combining a plurality of speech units based on the first segment sequence and selecting one speech unit string from said plurality of first speech unit strings; and concatenating a plurality of speech units included in the selected speech unit string to generate synthetic speech, the generating/selecting including performing repeatedly a first processing and a second processing, the first processing generating, based on maximum W (W is a predetermined value) second speech unit strings corresponding to a second segment sequence as a partial sequence of the first segment sequence, a plurality of third speech unit strings corresponding to a third segment sequence as a partial sequence obtained by adding a segment to the second segment sequence, and the second processing selecting maximum W third speech unit strings from said plurality of third speech unit strings, calculating a total cost of each of said plurality of third speech unit strings, calculating a penalty coefficient corresponding to the total cost for each of said plurality of third speech unit strings based on a restriction concerning quickness of speech unit data acquisition, wherein the penalty coefficient depending on extent in which the restriction is approached, and calculating a evaluation value of each of said plurality of third speech unit strings by correcting the total cost with the penalty coefficient, wherein the second processing including selecting the maximum W third speech unit strings from said plurality of third speech unit strings based on the evaluation value of each of said plurality of third speech unit strings. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
Specification