Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon
First Claim
1. A method for editing non-verbal information of a speech message synthesized by rules in correspondence to a text, said method comprising the steps of:
- (a) inserting in said text, at the position of a character or character string to be added with non-verbal information, a prosodic feature control command of a semantic layer (hereinafter referred to as an S layer) and/or an interpretation layer (hereinafter referred to as an I layer) of a multi-layered description language so as to effect prosody control corresponding to said non-verbal information, said multi-layered description language being composed of said S and I layers and a parameter layer (hereinafter referred to as a P layer), said P layer being a group of controllable prosodic parameters including at least pitch and power, said I layer being a group of prosodic feature control commands for specifying details of control of said prosodic parameters of said P layer, said S layer being a group of prosodic feature control commands each represented by a phrase or word indicative of an intended meaning of non-verbal information, for executing a command set composed of at least one prosodic feature control command of said I layer, and the relationship between each prosodic feature control command of said S layer and said set of prosodic feature control commands of said I layer and prosody control rules indicating details of control of said prosodic parameters of said P layer by said prosodic feature control commands of said I layer being prestored in a prosody control rule database;
(b) extracting from said text a prosodic parameter string of speech synthesized by rules;
(c) controlling that one of said prosodic parameters of said prosodic parameter string corresponding to said character or character string to be added with said non-verbal information, by referring to said prosody control rules stored in said prosody control rule database; and
(d) synthesizing speech from said prosodic parameter string containing said controlled prosodic parameter and for outputting a synthetic speech message.
1 Assignment
0 Petitions
Accused Products
Abstract
A three-layered prosody control description language is used to insert prosodic feature control commands in a text at the positions of characters or a character string to be added with non-verbal information. The three-layered prosody control description language is composed of: a semantic layer (S layer) having, as its prosodic feature control commands, control commands each represented by a word indicative of the meaning of non-verbal information; an interpretation layer (I layer) having, as its prosodic feature control commands, control commands which interpret the prosodic feature control commands of the S layer and specify control of prosodic parameters of speech; and a parameter layer (P layer) having prosodic parameters which are objects of control by the prosodic feature control commands of the I layer. The text is converted into a prosodic parameter string through synthesis-by-rule. The prosodic parameters corresponding to characters or character string to be corrected are corrected by the prosodic feature control commands of the I layer, and speech is synthesized from a parameter string containing the corrected prosodic parameters.
240 Citations
13 Claims
-
1. A method for editing non-verbal information of a speech message synthesized by rules in correspondence to a text, said method comprising the steps of:
-
(a) inserting in said text, at the position of a character or character string to be added with non-verbal information, a prosodic feature control command of a semantic layer (hereinafter referred to as an S layer) and/or an interpretation layer (hereinafter referred to as an I layer) of a multi-layered description language so as to effect prosody control corresponding to said non-verbal information, said multi-layered description language being composed of said S and I layers and a parameter layer (hereinafter referred to as a P layer), said P layer being a group of controllable prosodic parameters including at least pitch and power, said I layer being a group of prosodic feature control commands for specifying details of control of said prosodic parameters of said P layer, said S layer being a group of prosodic feature control commands each represented by a phrase or word indicative of an intended meaning of non-verbal information, for executing a command set composed of at least one prosodic feature control command of said I layer, and the relationship between each prosodic feature control command of said S layer and said set of prosodic feature control commands of said I layer and prosody control rules indicating details of control of said prosodic parameters of said P layer by said prosodic feature control commands of said I layer being prestored in a prosody control rule database;
(b) extracting from said text a prosodic parameter string of speech synthesized by rules;
(c) controlling that one of said prosodic parameters of said prosodic parameter string corresponding to said character or character string to be added with said non-verbal information, by referring to said prosody control rules stored in said prosody control rule database; and
(d) synthesizing speech from said prosodic parameter string containing said controlled prosodic parameter and for outputting a synthetic speech message. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for editing non-verbal information by adding information of mental states to a speech message synthesized by rules in correspondence to a text, said method comprising the steps of:
-
(a) extracting from said text a prosodic parameter string of speech synthesized by rules;
(b) correcting that one of prosodic parameters of said prosodic parameter string corresponding to the character or character string to be added with said non-verbal information, through the use of at least one of basic prosody control rules defined by modifications of at least one of pitch patterns, power patterns and durations characteristic of a plurality of predetermined pieces of non-verbal information, respectively, said basic prosody control rules being stored in a memory in correspondence to predetermined mental states, respectively; and
(c) synthesizing speech from said prosodic parameter string containing said corrected prosodic parameter and outputting a synthetic speech message;
wherein a multi-layered description language is defined which comprises a semantic layer (hereinafter referred to as an S layer) composed of prosodic feature control commands each represented by a word or phrase indicative of an intended meaning of predetermined non-verbal information, an interpretation layer (hereinafter referred to as an I layer) composed of prosodic feature control commands each defining a physical meaning of control of prosodic parameters by one prosodic feature control command of said S layer and a parameter layer composed of a cluster of prosodic parameters of a control object, said method further comprising a step of describing in said multi-layered description language, the prosodic feature control command corresponding to said non-verbal information in said text at the position of said character or character string to be added with said non-verbal information.
-
-
9. A method for editing non-verbal information by adding information of mental states to a speech message synthesized by rules in correspondence to a text, said method comprising the steps of:
-
(a) analyzing said text to extract therefrom a prosodic parameter string based on synthesis-by-rule speech;
(b) correcting that one of prosodic parameters of said prosodic parameter string corresponding to the character or character string to be added with said non-verbal information, through the use of modifications of at least one of pitch patterns, power patterns and durations based on prosodic parameters characteristic of said non-verbal information, said basic prosody control rules being stored in a memory in correspondence to predetermined mental states, respectively;
wherein said correcting step is performed following a prosodic feature control command described in said text in correspondence to said character or character string to be added with said non-verbal information;
(c) synthesizing speech by said corrected prosodic parameter;
(d) converting said modification information of said prosodic parameter to character conversion information such as the position, size, typeface and display color of each character in said text based on relationships between each one of combinations of values of different prosodies to effect a desired one of mental states and at least one of size and position of characters most matched as a visual impression of the mental state and between each one of variation patterns of at least one of prosody to effect another desired one of mental states and at least one of size and position of characters most matched as another desired visual impression of the mental state, these relationships being obtained through experiments; and
(e) converting the characters of said text based on said character conversion information and displaying them accordingly;
wherein a multi-layered description language is defined which comprises a semantic layer (hereinafter referred to as an S layer) composed of prosodic feature control commands each represented by a word or phrase indicative of an intended meaning of predetermined non-verbal information, an interpretation layer (hereinafter referred to as an I layer) composed of prosodic feature control commands each defining a physical meaning of control of prosodic parameters by one prosodic feature control command of said S layer and a parameter layer composed of a cluster of prosodic parameters of a control object, said method further comprising a step of describing in said multi-layered description language, the prosodic feature control command corresponding to said non-verbal information in said text at the position of said character or character string to be added with said non-verbal information.
-
-
10. A synthetic speech message editing/creating apparatus comprising:
-
a text/prosodic feature control command input part into which a prosodic feature control command to be inserted in an input text is input, said phonological control command being described in a multi-layered description language composed of semantic, interpretation and parameter layers (hereinafter referred to simply as an S, an I and a P layer, respectively), said P layer being a group of controllable prosodic parameters including at least pitch and power, said I layer being a group of prosodic feature control commands for specifying details of control of said prosodic parameters of said P layer, and said S layer being a group of prosodic feature control commands each represented by a phrase or word indicative of an intended meaning of non-verbal information, for executing command sets each composed of at least one prosodic feature control command of said I layer;
a text/prosodic feature control command separating part for separating said prosodic feature control command from said text;
a speech synthesis information converting part for generating a prosodic parameter string from said separated text based on a “
synthesis-by-rule”
method;
a prosodic feature control command analysis part for extracting, from said separated prosodic feature control command, information about its position in said text;
a prosodic feature control part for controlling and correcting said prosodic parameter string based on said extracted position information and said separated prosodic feature control command; and
speech synthesis part for generating synthetic speech based on said corrected prosodic parameter string from said prosodic feature control part. - View Dependent Claims (11, 12)
Input speech analysis part for analyzing input speech containing non-verbal information to obtain prosodic parameters;
a prosodic feature/prosodic feature control command conversion part for converting said prosodic parameters in said input speech to a set of prosodic feature control commands; and
a prosody control rule database for storing said set of prosodic feature control commands in correspondence to said non-verbal information.
-
-
12. The apparatus of claim 11, which further comprises a display type synthetic speech editing part provided with a display screen and GUI means, and wherein said display type synthetic speech editing part reads out a set of prosodic feature control commands corresponding to desired non-verbal information from said prosody control rule database and into said prosodic feature/prosodic feature control command conversion part, then displays said read-out set of prosodic feature control commands on said display screen, and corrects said set of prosodic feature control commands by said GUI, thereby updating the corresponding prosodic feature control command set in said prosody control rule database.
-
13. A recording medium having recorded thereon a procedure for editing/creating non-verbal information of a synthetic speech message by rules, said procedure comprising the steps of:
-
(a) describing a prosodic feature control command corresponding to said non-verbal information in a multi-layered description language in an input text at the position of a character or character string to be added with said non-verbal information, said multi-layered description language being composed of a semantic layer (hereinafter referred to as an S layer), an interpretation layer (hereinafter referred to as an I layer) and a parameter layer (hereinafter referred to as a P layer);
(b) extracting from said text a prosodic parameter string of speech synthesized by rules;
(c) controlling that one of said prosodic parameter string corresponding to said character or character string to be added with said non-verbal information, by said prosodic feature control command; and
(d) synthesizing speech from said prosodic parameter string containing said controlled prosodic parameter and outputting a synthetic speech message.
-
Specification