Concatenation of speech segments by use of a speech synthesizer

US 6,366,883 B1
Filed: 02/16/1999
Issued: 04/02/2002
Est. Priority Date: 05/15/1996
Status: Expired due to Term

First Claim

Patent Images

1. A speech synthesizer apparatus comprising:

first storage means for storing speech segments of speech waveform signals of natural utterance;

speech analyzing means, based on the speech segments of the speech waveform signals stored in said first storage means and a phoneme sequence corresponding to the speech waveform signals, for extracting and outputting index information on each phoneme of the speech waveform signals, first acoustic feature parameters of each phoneme indicated by the index information, and prosodic feature parameters for each phoneme indicated by the index information;

second storage means for storing the index information, the first acoustic feature parameters, and the prosodic feature parameters outputted from said speech analyzing means;

weighting coefficient training means for calculating acoustic distances in second acoustic feature parameters between one target phoneme from the same phonemic kind and the phoneme candidates other than the target phoneme based on the first acoustic feature parameters and the prosodic feature parameters which are stored in said second storage means, and for determining weighting coefficient vectors for respective target phonemes defining degrees of contribution to the second acoustic feature parameters for respective phoneme candidates by executing a predetermined statistical analysis for each of the second acoustic feature parameters for respective phoneme candidates based on the calculated acoustic distances;

third storage means for storing weighting coefficient vectors for the respective target phonemes determined by the weighting coefficient training means;

speech unit selecting means, based on the weighting coefficient vectors for the respective target phonemes stored in said third storage means, and the prosodic feature parameters stored in said second storage means, for searching for a combination of phoneme candidates which correspond to a phoneme sequence of an input sentence and which minimizes a cost including a target cost representing approximate costs between a target phoneme and the phoneme candidates and a concatenation cost repenting approximate costs between two phoneme candidates and a concatenation cost representing approximate costs been two phoneme candidates to be adjacently concatenated, and for outputting index information on the searched out combination of phoneme candidates, said target cost being represented by either one of a predetermined non-linear multiplication and a predetermined non-linear combination, with use of predetermined suitability functions each of fuzzy membership function, said concatenation cost being represented by either one of another predetermined non-linear multiplication and another predetermined non-linear combination with use of another predetermined suitability functions each of fuzzy membership function; and

speech synthesizing means for synthesizing and outputting a speech signal corresponding to the input phoneme sequence by sequentially reading out speech segments of speech waveform signals corresponding to the index information from said first storage means based on the index information outputted from said unit selecting means, and by concatenating the read-out speech segments of the speech waveform signals.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a speech synthesizer apparatus, a weighting coefficient training controller calculates acoustic distances in second acoustic feature parameters between one target phoneme from the same phoneme and the phoneme candidates other than the target phoneme based on first acoustic feature parameters and prosodic feature parameters, and determines weighting coefficient vectors for respective target phonemes defining degrees of contribution to the second acoustic feature parameters for respective phoneme candidates by executing a predetermined statistical analysis therefor. Then, a speech unit selector searches for a combination of phoneme candidates which correspond to a phoneme sequence of an input sentence and which minimizes a cost including a target cost representing approximate costs between a target phoneme and the phoneme candidates and a concatenation cost representing approximate costs between two phoneme candidates to be adjacently concatenated, and outputs index information on the searched out combination of phoneme candidates. Further, a speech synthesizer synthesizes a speech signal corresponding to the input phoneme sequence by sequentially reading out speech segments of speech waveform signals corresponding to the index information and concatenating the read speech segments of the speech waveform signals.

372 Citations

23 Claims

1. A speech synthesizer apparatus comprising:
- first storage means for storing speech segments of speech waveform signals of natural utterance;
  
  speech analyzing means, based on the speech segments of the speech waveform signals stored in said first storage means and a phoneme sequence corresponding to the speech waveform signals, for extracting and outputting index information on each phoneme of the speech waveform signals, first acoustic feature parameters of each phoneme indicated by the index information, and prosodic feature parameters for each phoneme indicated by the index information;
  
  second storage means for storing the index information, the first acoustic feature parameters, and the prosodic feature parameters outputted from said speech analyzing means;
  
  weighting coefficient training means for calculating acoustic distances in second acoustic feature parameters between one target phoneme from the same phonemic kind and the phoneme candidates other than the target phoneme based on the first acoustic feature parameters and the prosodic feature parameters which are stored in said second storage means, and for determining weighting coefficient vectors for respective target phonemes defining degrees of contribution to the second acoustic feature parameters for respective phoneme candidates by executing a predetermined statistical analysis for each of the second acoustic feature parameters for respective phoneme candidates based on the calculated acoustic distances;
  
  third storage means for storing weighting coefficient vectors for the respective target phonemes determined by the weighting coefficient training means;
  
  speech unit selecting means, based on the weighting coefficient vectors for the respective target phonemes stored in said third storage means, and the prosodic feature parameters stored in said second storage means, for searching for a combination of phoneme candidates which correspond to a phoneme sequence of an input sentence and which minimizes a cost including a target cost representing approximate costs between a target phoneme and the phoneme candidates and a concatenation cost repenting approximate costs between two phoneme candidates and a concatenation cost representing approximate costs been two phoneme candidates to be adjacently concatenated, and for outputting index information on the searched out combination of phoneme candidates, said target cost being represented by either one of a predetermined non-linear multiplication and a predetermined non-linear combination, with use of predetermined suitability functions each of fuzzy membership function, said concatenation cost being represented by either one of another predetermined non-linear multiplication and another predetermined non-linear combination with use of another predetermined suitability functions each of fuzzy membership function; and
  
  speech synthesizing means for synthesizing and outputting a speech signal corresponding to the input phoneme sequence by sequentially reading out speech segments of speech waveform signals corresponding to the index information from said first storage means based on the index information outputted from said unit selecting means, and by concatenating the read-out speech segments of the speech waveform signals.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 2. The speech synthesizer apparatus as claimed in claim 1,
- 3. The speech synthesizer apparatus as claimed in claim 1,wherein said weighting coefficient training means determines the weighting coefficient vectors for the respective target phonemes representing the degrees of contribution to the second acoustic feature parameters for the respective phoneme candidates, by extracting a plurality of best top N1 phoneme candidates based on the calculated acoustic distances, and by executing a linear regression analysis for each of the second acoustic feature parameters.
- 4. The speech synthesizer apparatus as claimed in claim 2,wherein said weighting coefficient training means determines the weighting coefficient vectors for the respective target phonemes representing the degrees of contribution to the second acoustic feature parameters for the respective phoneme candidates, by extracting a plurality of best top N1 phoneme candidates based on the calculated acoustic distances, and by executing a linear regression analysis for each of the second acoustic feature parameters.
- 5. The speech synthesizer apparatus as claimed in claim 1,wherein said weighting coefficient training means determines the weighting coefficient vectors for the respective target phonemes representing the degrees of contribution to the second acoustic feature parameters for the respective phoneme candidates, by extracting a plurality of best top N1 phoneme candidates based on the calculated acoustic distances, and by executing a statistical analysis using a predetermined neural network for each of the second acoustic feature parameters.
- 6. The speech synthesizer apparatus as claimed in claim 2,wherein said weighting coefficient training means determines the weighting coefficient vectors for the respective target phonemes representing the degrees of contribution to the second acoustic feature parameters for the respective phoneme candidates, by extracting a plurality of best top N1 phoneme candidates based on the calculated acoustic distances, and by executing a statistical analysis for each of the second acoustic feature parameters.
- 7. The speech synthesizer apparatus as claimed in claim 1,wherein said speech unit selecting means extracts a plurality of top N2 phoneme candidates that are best in terms of the cost including the target cost and the concatenation cost, and thereafter, searches for a combination of phoneme candidates that minimizes the cost.
- 8. The speech synthesizer apparatus as claimed in claim 2,wherein said speech unit selecting means extracts a plurality of top N2 phoneme candidates that are best in terms of the cost including the target cost and the concatenation cost, and thereafter, searches for a combination of phoneme candidates that minimizes the cost.
- 9. The speech synthesizer apparatus as claimed in claim 3,wherein said speech unit selecting means extracts a plurality of top N2 phoneme candidates that are best in terms of the cost including the target cost and the concatenation cost, and thereafter, searches for a combination of phoneme candidates that minimizes the cost.
- 10. The speech synthesizer apparatus as claimed in claim 1,wherein the first acoustic feature parameters include cepstrum coefficients, delta cepstrum coefficients and phoneme labels.
- 11. The speech synthesizer apparatus as claimed in claim 3,wherein the first acoustic feature parameters include cepstrum coefficients, delta cepstrum coefficients and phoneme labels.
- 12. The speech synthesizer apparatus as claimed in claim 7,wherein the first acoustic feature parameters include cepstrum coefficients, delta cepstrum coefficients and phoneme labels.
- 13. The speech synthesizer apparatus as claimed in claim 1,wherein the first acoustic feature parameters include formant parameters and voice source parameters.
- 14. The speech synthesizer apparatus as claimed in claim 3,wherein the first acoustic feature parameters include formant parameters and voice source parameters.
- 15. The speech synthesizer apparatus as claimed in claim 7,wherein the first acoustic feature parameters include formant parameters and voice source parameters.
- 16. The speech synthesizer apparatus as claimed in claim 1,wherein the prosodic feature parameters include phoneme durations, speech fundamental frequencies F₀, and powers.
- 17. The speech synthesizer apparatus as claimed in claim 3,wherein the prosodic feature parameters include phoneme durations, speech fundamental frequencies F₀, and powers.
- 18. The speech synthesizer apparatus as claimed in claim 7,wherein the prosodic feature parameters include phoneme durations, speech fundamental frequencies F₀, and powers.
- 19. The speech synthesizer apparatus as claimed in claim 1,wherein the second acoustic feature parameters include cepstral distances.
- 20. The speech synthesizer apparatus as claimed in claim 3,wherein the second acoustic feature parameters include cepstral distances.
- 21. The speech synthesizer apparatus as claimed in claim 7,wherein the second acoustic feature parameters include cepstral distances.
- 22. The speech synthesizer apparatus as claimed in claim 1,wherein said concatenation cost C is represented by the following equation:
- 23. The speech synthesizer apparatus as claimed in claim 1,wherein said concatenation cost C is represented by the following equation:

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Advanced Telecommunications Research Institute International
Original Assignee
ATR Interpreting Telephony Research Laboratories
Inventors
Campbell, Nick, Hunt, Andrew
Primary Examiner(s)
SMITS, TALIVALDIS IVARS

Application Number

US09/250,405
Time in Patent Office

1,141 Days
Field of Search

704/232, 704/258, 704/260, 704/267
US Class Current

704/260
CPC Class Codes

G10L 13/07   Concatenation rules

G10L 15/142   Hidden Markov Models [HMMs]

G10L 25/12   the extracted parameters be...

G10L 25/24   the extracted parameters be...

G10L 25/51   for comparison or discrimin...

Concatenation of speech segments by use of a speech synthesizer

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

372 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Concatenation of speech segments by use of a speech synthesizer

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

372 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links