Speech synthesizing system and method for modifying prosody based on match to database
First Claim
1. A speech synthesis system for generating synthesized speech based on input data representing speech to be synthesized, the system comprising:
- a database storing prosodic data for use in synthesizing speech, the prosodic data corresponding to key data being used as a retrieval key;
means for retrieving the prosodic data according to a degree of matching between such input data and such key data;
the degree of matching represented by an approximate cost determining by a cost method, whereby a smallest approximate cost corresponds to a highest degree of said matching;
prosodic data modifying rule means for storing a degree of modification of the prosodic data corresponding to the approximate cost, the degree of modification stored as a modifying rule;
means for modifying the prosodic data retrieved by the means for retrieving based on such input data, the degree of matching between such input data and such key data, and the modifying rule stored in the prosodic data modifying rule means; and
means for synthesizing a synthesized speech based on such input data and the prosodic data modified by the means for modifying.
4 Assignments
0 Petitions
Accused Products
Abstract
A speech synthesis system for storing in advance a degree of modification of prosodic data in a prosodic data modifying rule apparatus, the degree of modification corresponding to an approximate cost and being stored as a modifying rule, a prosodic data retrieving section for retrieving a prosodic data stored corresponding to a key data for use in retrieval, the prosodic data retrieved according to a degree of matching between the input data and the key data, the degree of matching represented by the approximate cost, a modifying section for modifying the retrieved prosodic data based on the degree of matching and the modifying rule stored in the prosodic data modifying rule means, and an output section for outputting synthesized speech based on the input data and the modified prosodic data.
63 Citations
43 Claims
-
1. A speech synthesis system for generating synthesized speech based on input data representing speech to be synthesized, the system comprising:
-
a database storing prosodic data for use in synthesizing speech, the prosodic data corresponding to key data being used as a retrieval key;
means for retrieving the prosodic data according to a degree of matching between such input data and such key data;
the degree of matching represented by an approximate cost determining by a cost method, whereby a smallest approximate cost corresponds to a highest degree of said matching;
prosodic data modifying rule means for storing a degree of modification of the prosodic data corresponding to the approximate cost, the degree of modification stored as a modifying rule;
means for modifying the prosodic data retrieved by the means for retrieving based on such input data, the degree of matching between such input data and such key data, and the modifying rule stored in the prosodic data modifying rule means; and
means for synthesizing a synthesized speech based on such input data and the prosodic data modified by the means for modifying. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
a fundamental frequency pattern representing a variation of a fundamental frequency with respect to time;
a voice intensity pattern representing a variation of a voice intensity with respect to time;
a phonological segment duration pattern representing a duration of a phonological segment; and
a pause data representing one of the presence or absence of a pause and the length of a pause.
-
-
18. The speech synthesis system according to claim 1, wherein in the database, the prosodic data are stored in the database such that each prosodic data forms a prosody controlling unit.
-
19. The speech synthesis system according to claim 1, further comprising a prosody controlling unit comprising one of:
-
an accent phrase;
a phrase comprising one or more accent phrase;
a bunsetsu;
a phrase comprising one or more bunsetsus;
a word;
a phrase comprising one or more words;
a stress phrase; and
a phrase comprising one or more stress phrases.
-
-
20. The speech synthesis system according to claim 1, wherein:
-
each of such input data and such key data comprises a plurality of types of speech indices each being a factor in determining a speech to be synthesized; and
the degree of matching between such input data and such key data is such that in each type of the speech indices, a degree of matching between such input data and such key data is weighted, and the weighted data are combined together.
-
-
21. The speech synthesis system according to claim 20, wherein the speech indices comprises data substantially indicating at least one of a phonological segment string of the speech to be synthesized, an accent position in the speech to be synthesized, a linguistic data representing a linguistic attribute of the speech to be synthesized and one of the length of a pause and the presence or absence in the speech to be synthesized.
-
22. The speech synthesis system according to claim 21, wherein:
-
the speech indices comprises a data substantially indicating a phonological segment string of the speech to be synthesized; and
the degree of matching between the speech indices in the input data and the speech indices in the key data includes a degree of similarity of acoustic feature data between phonological segments.
-
-
23. The speech synthesis system according to claim 20, wherein the speech indices comprises a phonological segment category string representing a phonological segment category to which a phonological segment in the speech to be synthesized belongs.
-
24. The speech synthesis system according to claim 23, wherein the degree of matching between the speech indices in the input data and the speech indices in such key data comprises a degree of similarity of the phonological segment category between the phonological segments.
-
25. The speech synthesis system according to claim 20, wherein the prosodic data comprises a plurality of types of prosodic feature data characterizing the speech to be synthesized.
-
26. The speech synthesis system according to claim 25, wherein the database is for storing the plurality of types of prosodic feature data so that the plurality of types of prosodic feature data comprises a set of prosodic feature data.
-
27. The speech synthesis system according to claim 26, wherein the plurality of types of prosodic feature data are extracted from an identical actual human voice.
-
28. The speech synthesis system according to claim 25, wherein the prosodic feature data comprises at least one of:
-
a fundamental frequency pattern representing a variation of a fundamental frequency with respect to time;
a voice intensity pattern representing a variation of a voice intensity with respect to time;
a phonological segment duration pattern representing a duration of a phonological segment; and
a pause data representing one of the presence or absence of a pause and the length of a pause.
-
-
29. The speech synthesis system according to claim 28, wherein the phonological segment duration pattern comprises at least one of a phoneme duration pattern, a mora duration pattern, and a syllable duration pattern.
-
30. The speech synthesis system according to claim 25, further comprising means for retrieving and modifying each of the plurality of types of prosodic feature data according to the weighted degrees of matching between the input data and the key data, the weighted degrees of matching being different from each other.
-
31. The speech synthesis system according to claim 20, wherein the means for retrieving the prosodic data and the means for modifying the prosodic data are each for using a different weighted degree of matching between the input data and the key data.
-
32. The speech synthesis system according to claim 20, wherein the means for retrieving the prosodic data and the means for modifying the prosodic data are for using an identical weighted degree of matching between the input data and the key data.
-
33. The speech synthesis system according to claim 1, wherein the means for modifying is for modifying the prosodic data retrieved by the means for retrieving based on a degree of matching between one of:
-
each phoneme;
each mora;
each syllable;
each unit of generating a speech waveform in the means for synthesizing; and
each phonological segment.
-
-
34. The speech synthesis system according to claim 33, wherein the degree of matching is determined based on at least one of:
-
a distance based on an acoustic characteristic;
a distance obtained from one of a manner of articulation, a place of articulation, and a duration; and
a distance based on a confusion matrix obtained by an auditory experiment.
-
-
35. The speech synthesis system according to claim 34, wherein the acoustic characteristic is at least one characteristic of the phonological segments selected from a fundamental frequency thereof, an intensity thereof, a duration thereof, and a spectrum thereof.
-
36. The speech synthesis system according to claim 1, wherein the database is for storing key data and prosodic data of a plurality of types of languages.
-
37. A method of synthesizing speech based on input data representing speech to be synthesized, the method comprising:
-
storing in advance of a degree of modification of prosodic data in a prosodic data modifying rule means, the degree of modification corresponding to an approximate cost and being stored as a modifying rule;
retrieving prosodic data from a database in which prosodic data for use in synthesizing speech is stored corresponding to key data for use in retrieval, the prosodic data, the prosodic data retrieved according to a degree of matching between such input data and such key data, the degree of matching represented by the approximate cost determined by a cost method, whereby a smallest approximate cost corresponds to a highest degree of said matching;
modifying the retrieved prosodic data based on the degree of matching between such input data and such key data and the modifying rule stored in the prosodic data modifying rule means; and
outputting synthesized speech based on the input data and the modified prosodic data. - View Dependent Claims (38, 39, 40, 41, 42)
each of such input data and such key data comprises a plurality of types of speech indices each being a factor in determining a speech to be synthesized;
the degree of matching between such input data and such key data is such that in each type of the speech indices, a degree of matching between such input data and such key data is weighted, and the weighted data are combined together.
-
-
39. The method of synthesizing a speech according to claim 38, wherein the prosodic data comprises a plurality of types of prosodic feature data characterizing such input data.
-
40. The method of synthesizing a speech according to claim 39, wherein each of the plurality of types of prosodic feature data is retrieved and modified according to the weighted degrees of matching between such input data and such key data, the weighted degrees of matching being different from each other.
-
41. The method of synthesizing a speech according to claim 38, wherein the retrieving the prosodic data and the modifying the prosodic data are performed each using a different weighted degree of matching between such input data and such key data.
-
42. A method of synthesizing a speech according to claim 38, wherein the retrieving the prosodic data and the modifying the prosodic data are performed using an identical weighted degree of matching between such input data and such key data.
-
43. A speech synthesis system wherein an input text is converted into synthesized speech to be outputted, the system comprising:
-
language processing means wherein input text is parsed for outputting a phonetic character string and linguistic data;
a prosodic information database storing prosodic feature data, linguistic data, and a phonetic character string so that the prosodic feature data correspond to the linguistic data and the phonetic character string, the prosodic feature data being extracted from actual human speech, and phonetic character string and the linguistic data corresponding to speech to be synthesized;
a retrieving means for retrieving a prosodic feature data from the prosodic feature data stored in the prosodic information database, the retrieved prosodic feature data corresponding to at least a portion of retrieval items comprising the phonetic character string and the linguistic data outputted from the language processing means;
prosodic data modifying rule means for storing a degree of modification of prosodic data corresponding to the approximate cost, the degree of modification stored as a modifying rule;
a prosody modifying means for modifying the prosodic feature data according to the modifying rule in response to a degree of matching between the retrieval item and the data stored in the prosodic information database, the prosodic feature data being retrieved and selected from the prosodic information database, the degree of matching represented by the approximate cost, the modifying rule being the degree of modification corresponding to the approximate cost; and
a waveform generating means for generating a speech waveform based on the prosodic feature data received from the prosody modifying means and the phonetic character string received from the language processing means.
-
Specification