Speech synthesizing system and method for modifying prosody based on match to database

US 6,823,309 B1
Filed: 11/27/2000
Issued: 11/23/2004
Est. Priority Date: 03/25/1999
Status: Expired due to Term

First Claim

Patent Images

1. A speech synthesis system for generating synthesized speech based on input data representing speech to be synthesized, the system comprising:

a database storing prosodic data for use in synthesizing speech, the prosodic data corresponding to key data being used as a retrieval key;

means for retrieving the prosodic data according to a degree of matching between such input data and such key data;

the degree of matching represented by an approximate cost determining by a cost method, whereby a smallest approximate cost corresponds to a highest degree of said matching;

prosodic data modifying rule means for storing a degree of modification of the prosodic data corresponding to the approximate cost, the degree of modification stored as a modifying rule;

means for modifying the prosodic data retrieved by the means for retrieving based on such input data, the degree of matching between such input data and such key data, and the modifying rule stored in the prosodic data modifying rule means; and

means for synthesizing a synthesized speech based on such input data and the prosodic data modified by the means for modifying.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech synthesis system for storing in advance a degree of modification of prosodic data in a prosodic data modifying rule apparatus, the degree of modification corresponding to an approximate cost and being stored as a modifying rule, a prosodic data retrieving section for retrieving a prosodic data stored corresponding to a key data for use in retrieval, the prosodic data retrieved according to a degree of matching between the input data and the key data, the degree of matching represented by the approximate cost, a modifying section for modifying the retrieved prosodic data based on the degree of matching and the modifying rule stored in the prosodic data modifying rule means, and an output section for outputting synthesized speech based on the input data and the modified prosodic data.

63 Citations

View as Search Results

43 Claims

1. A speech synthesis system for generating synthesized speech based on input data representing speech to be synthesized, the system comprising:
- a database storing prosodic data for use in synthesizing speech, the prosodic data corresponding to key data being used as a retrieval key;
  
  means for retrieving the prosodic data according to a degree of matching between such input data and such key data;
  
  the degree of matching represented by an approximate cost determining by a cost method, whereby a smallest approximate cost corresponds to a highest degree of said matching;
  
  prosodic data modifying rule means for storing a degree of modification of the prosodic data corresponding to the approximate cost, the degree of modification stored as a modifying rule;
  
  means for modifying the prosodic data retrieved by the means for retrieving based on such input data, the degree of matching between such input data and such key data, and the modifying rule stored in the prosodic data modifying rule means; and
  
  means for synthesizing a synthesized speech based on such input data and the prosodic data modified by the means for modifying.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
- - 2. The speech synthesis system according to claim 1, wherein each of such input data and such key data comprises a phonetic character string representing a phonetic attribute of the speech to be synthesized.
  - 3. The speech synthesis system according to claim 2, wherein each of such input data and such key data further comprises linguistic data representing a linguistic attribute of the speech to be synthesized.
  - 4. The speech synthesis system according to claim 3, wherein such linguistic data comprises at least one of syntactic data and semantic data of the speech to be synthesized.
  - 5. The speech synthesis system according to claim 3, further comprising a language processing means for parsing text data inputted in the speech synthesis system and producing a processed phonetic character string and processed linguistic data.
  - 6. The speech synthesis system according to claim 2, wherein the phonetic character string comprises data substantially indicating at least one of a phonological segment string of the speech to be synthesized, an accent position in the speech to be synthesized, and either one of the presence or absence and the length of a pause in the speech to be synthesized.
  - 7. The speech synthesis system according to claim 1, wherein each of such input data and such key data comprises a phonological segment category string representing a phonological segment category to which a phonological segment in the speech to be synthesized belongs.
  - 8. The speech synthesis system according to claim 7, further comprising means for converting data into the phonological segment category string, the data being at least one data of data corresponding to such input data inputted to the speech synthesis system and data corresponding to retrieval key data stored in the database.
  - 9. The speech synthesis system according to claim 7, wherein the phonological segment category is such that phonological segments are categorized by using at least one of a manner of articulation thereof, a place of articulation thereof, and a duration thereof.
  - 10. The speech synthesis system according to claim 7, wherein the phonological segment category is such that prosodic patterns are grouped by using a statistical method, and that the phonological segments are grouped so as to best reflect the grouped prosodic patterns.
  - 11. The speech synthesis system according to claim 10, wherein the statistical method is a multivariate analysis method.
  - 12. The speech synthesis system according to claim 7, wherein the phonological segment category is such that the phonological segments are grouped according to a psychological distance between each of the phonemes of each phonological segment, each distance being determined based on a confusion matrix by using a statistical method.
  - 13. The speech synthesis system according to claim 12, wherein the statistical method is a multivariate analysis method.
  - 14. The speech synthesis system according to claim 7, wherein the phonological segment category is such that the phonological segments are grouped according to a similarity of a physical characteristic between the phonological segments.
  - 15. The speech synthesis system according to claim 14, wherein the physical characteristic is at least one characteristic of the phonological segments selected from a fundamental frequency thereof, an intensity thereof, a duration thereof, and a spectrum thereof.
  - 16. The speech synthesis system according to claim 1, wherein the prosodic data stored in the database comprises prosodic feature data extracted from an identical actual human voice.
  - 17. The speech synthesis system according to claim 16, wherein the prosodic feature data comprises at least one of:
18. The speech synthesis system according to claim 1, wherein in the database, the prosodic data are stored in the database such that each prosodic data forms a prosody controlling unit.
19. The speech synthesis system according to claim 1, further comprising a prosody controlling unit comprising one of:
- an accent phrase;
  
  a phrase comprising one or more accent phrase;
  
  a bunsetsu;
  
  a phrase comprising one or more bunsetsus;
  
  a word;
  
  a phrase comprising one or more words;
  
  a stress phrase; and
  
  a phrase comprising one or more stress phrases.
20. The speech synthesis system according to claim 1, wherein:
- each of such input data and such key data comprises a plurality of types of speech indices each being a factor in determining a speech to be synthesized; and
  
  the degree of matching between such input data and such key data is such that in each type of the speech indices, a degree of matching between such input data and such key data is weighted, and the weighted data are combined together.
21. The speech synthesis system according to claim 20, wherein the speech indices comprises data substantially indicating at least one of a phonological segment string of the speech to be synthesized, an accent position in the speech to be synthesized, a linguistic data representing a linguistic attribute of the speech to be synthesized and one of the length of a pause and the presence or absence in the speech to be synthesized.
22. The speech synthesis system according to claim 21, wherein:
- the speech indices comprises a data substantially indicating a phonological segment string of the speech to be synthesized; and
  
  the degree of matching between the speech indices in the input data and the speech indices in the key data includes a degree of similarity of acoustic feature data between phonological segments.
23. The speech synthesis system according to claim 20, wherein the speech indices comprises a phonological segment category string representing a phonological segment category to which a phonological segment in the speech to be synthesized belongs.
24. The speech synthesis system according to claim 23, wherein the degree of matching between the speech indices in the input data and the speech indices in such key data comprises a degree of similarity of the phonological segment category between the phonological segments.
25. The speech synthesis system according to claim 20, wherein the prosodic data comprises a plurality of types of prosodic feature data characterizing the speech to be synthesized.
26. The speech synthesis system according to claim 25, wherein the database is for storing the plurality of types of prosodic feature data so that the plurality of types of prosodic feature data comprises a set of prosodic feature data.
27. The speech synthesis system according to claim 26, wherein the plurality of types of prosodic feature data are extracted from an identical actual human voice.
28. The speech synthesis system according to claim 25, wherein the prosodic feature data comprises at least one of:
- a fundamental frequency pattern representing a variation of a fundamental frequency with respect to time;
  
  a voice intensity pattern representing a variation of a voice intensity with respect to time;
  
  a phonological segment duration pattern representing a duration of a phonological segment; and
  
  a pause data representing one of the presence or absence of a pause and the length of a pause.
29. The speech synthesis system according to claim 28, wherein the phonological segment duration pattern comprises at least one of a phoneme duration pattern, a mora duration pattern, and a syllable duration pattern.
30. The speech synthesis system according to claim 25, further comprising means for retrieving and modifying each of the plurality of types of prosodic feature data according to the weighted degrees of matching between the input data and the key data, the weighted degrees of matching being different from each other.
31. The speech synthesis system according to claim 20, wherein the means for retrieving the prosodic data and the means for modifying the prosodic data are each for using a different weighted degree of matching between the input data and the key data.
32. The speech synthesis system according to claim 20, wherein the means for retrieving the prosodic data and the means for modifying the prosodic data are for using an identical weighted degree of matching between the input data and the key data.
33. The speech synthesis system according to claim 1, wherein the means for modifying is for modifying the prosodic data retrieved by the means for retrieving based on a degree of matching between one of:
- each phoneme;
  
  each mora;
  
  each syllable;
  
  each unit of generating a speech waveform in the means for synthesizing; and
  
  each phonological segment.
34. The speech synthesis system according to claim 33, wherein the degree of matching is determined based on at least one of:
- a distance based on an acoustic characteristic;
  
  a distance obtained from one of a manner of articulation, a place of articulation, and a duration; and
  
  a distance based on a confusion matrix obtained by an auditory experiment.
35. The speech synthesis system according to claim 34, wherein the acoustic characteristic is at least one characteristic of the phonological segments selected from a fundamental frequency thereof, an intensity thereof, a duration thereof, and a spectrum thereof.
36. The speech synthesis system according to claim 1, wherein the database is for storing key data and prosodic data of a plurality of types of languages.

37. A method of synthesizing speech based on input data representing speech to be synthesized, the method comprising:
- storing in advance of a degree of modification of prosodic data in a prosodic data modifying rule means, the degree of modification corresponding to an approximate cost and being stored as a modifying rule;
  
  retrieving prosodic data from a database in which prosodic data for use in synthesizing speech is stored corresponding to key data for use in retrieval, the prosodic data, the prosodic data retrieved according to a degree of matching between such input data and such key data, the degree of matching represented by the approximate cost determined by a cost method, whereby a smallest approximate cost corresponds to a highest degree of said matching;
  
  modifying the retrieved prosodic data based on the degree of matching between such input data and such key data and the modifying rule stored in the prosodic data modifying rule means; and
  
  outputting synthesized speech based on the input data and the modified prosodic data.
- View Dependent Claims (38, 39, 40, 41, 42)
- - 38. The method of synthesizing a speech according to claim 37, wherein:
39. The method of synthesizing a speech according to claim 38, wherein the prosodic data comprises a plurality of types of prosodic feature data characterizing such input data.
40. The method of synthesizing a speech according to claim 39, wherein each of the plurality of types of prosodic feature data is retrieved and modified according to the weighted degrees of matching between such input data and such key data, the weighted degrees of matching being different from each other.
41. The method of synthesizing a speech according to claim 38, wherein the retrieving the prosodic data and the modifying the prosodic data are performed each using a different weighted degree of matching between such input data and such key data.
42. A method of synthesizing a speech according to claim 38, wherein the retrieving the prosodic data and the modifying the prosodic data are performed using an identical weighted degree of matching between such input data and such key data.

43. A speech synthesis system wherein an input text is converted into synthesized speech to be outputted, the system comprising:
- language processing means wherein input text is parsed for outputting a phonetic character string and linguistic data;
  
  a prosodic information database storing prosodic feature data, linguistic data, and a phonetic character string so that the prosodic feature data correspond to the linguistic data and the phonetic character string, the prosodic feature data being extracted from actual human speech, and phonetic character string and the linguistic data corresponding to speech to be synthesized;
  
  a retrieving means for retrieving a prosodic feature data from the prosodic feature data stored in the prosodic information database, the retrieved prosodic feature data corresponding to at least a portion of retrieval items comprising the phonetic character string and the linguistic data outputted from the language processing means;
  
  prosodic data modifying rule means for storing a degree of modification of prosodic data corresponding to the approximate cost, the degree of modification stored as a modifying rule;
  
  a prosody modifying means for modifying the prosodic feature data according to the modifying rule in response to a degree of matching between the retrieval item and the data stored in the prosodic information database, the prosodic feature data being retrieved and selected from the prosodic information database, the degree of matching represented by the approximate cost, the modifying rule being the degree of modification corresponding to the approximate cost; and
  
  a waveform generating means for generating a speech waveform based on the prosodic feature data received from the prosody modifying means and the phonetic character string received from the language processing means.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sovereign Peak Ventures, LLC (Dominion Harbor Enterprises, LLC)
Original Assignee
Matsushita Electric Industrial Company Limited (Panasonic Holdings Corporation)
Inventors
Yamagami, Katsuyoshi, Kamai, Takahiro, Matsui, Kenji, Kato, Yumiko
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
Storm, Donald L.

Application Number

US09/701,183
Time in Patent Office

1,457 Days
Field of Search

704/268, 704/267, 704/266, 704/260, 704/258
US Class Current

704/267
CPC Class Codes

G10L 13/04   Details of speech synthesis...

G10L 13/08   Text analysis or generation...

G10L 13/10   Prosody rules derived from ...

Speech synthesizing system and method for modifying prosody based on match to database

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

63 Citations

43 Claims

Specification

Solutions

Use Cases

Quick Links

Speech synthesizing system and method for modifying prosody based on match to database

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

63 Citations

43 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links