Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon

US 6,226,614 B1
Filed: 05/18/1998
Issued: 05/01/2001
Est. Priority Date: 05/21/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method for editing non-verbal information of a speech message synthesized by rules in correspondence to a text, said method comprising the steps of:

(a) inserting in said text, at the position of a character or character string to be added with non-verbal information, a prosodic feature control command of a semantic layer (hereinafter referred to as an S layer) and/or an interpretation layer (hereinafter referred to as an I layer) of a multi-layered description language so as to effect prosody control corresponding to said non-verbal information, said multi-layered description language being composed of said S and I layers and a parameter layer (hereinafter referred to as a P layer), said P layer being a group of controllable prosodic parameters including at least pitch and power, said I layer being a group of prosodic feature control commands for specifying details of control of said prosodic parameters of said P layer, said S layer being a group of prosodic feature control commands each represented by a phrase or word indicative of an intended meaning of non-verbal information, for executing a command set composed of at least one prosodic feature control command of said I layer, and the relationship between each prosodic feature control command of said S layer and said set of prosodic feature control commands of said I layer and prosody control rules indicating details of control of said prosodic parameters of said P layer by said prosodic feature control commands of said I layer being prestored in a prosody control rule database;

(b) extracting from said text a prosodic parameter string of speech synthesized by rules;

(c) controlling that one of said prosodic parameters of said prosodic parameter string corresponding to said character or character string to be added with said non-verbal information, by referring to said prosody control rules stored in said prosody control rule database; and

(d) synthesizing speech from said prosodic parameter string containing said controlled prosodic parameter and for outputting a synthetic speech message.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A three-layered prosody control description language is used to insert prosodic feature control commands in a text at the positions of characters or a character string to be added with non-verbal information. The three-layered prosody control description language is composed of: a semantic layer (S layer) having, as its prosodic feature control commands, control commands each represented by a word indicative of the meaning of non-verbal information; an interpretation layer (I layer) having, as its prosodic feature control commands, control commands which interpret the prosodic feature control commands of the S layer and specify control of prosodic parameters of speech; and a parameter layer (P layer) having prosodic parameters which are objects of control by the prosodic feature control commands of the I layer. The text is converted into a prosodic parameter string through synthesis-by-rule. The prosodic parameters corresponding to characters or character string to be corrected are corrected by the prosodic feature control commands of the I layer, and speech is synthesized from a parameter string containing the corrected prosodic parameters.

240 Citations

13 Claims

1. A method for editing non-verbal information of a speech message synthesized by rules in correspondence to a text, said method comprising the steps of:
- (a) inserting in said text, at the position of a character or character string to be added with non-verbal information, a prosodic feature control command of a semantic layer (hereinafter referred to as an S layer) and/or an interpretation layer (hereinafter referred to as an I layer) of a multi-layered description language so as to effect prosody control corresponding to said non-verbal information, said multi-layered description language being composed of said S and I layers and a parameter layer (hereinafter referred to as a P layer), said P layer being a group of controllable prosodic parameters including at least pitch and power, said I layer being a group of prosodic feature control commands for specifying details of control of said prosodic parameters of said P layer, said S layer being a group of prosodic feature control commands each represented by a phrase or word indicative of an intended meaning of non-verbal information, for executing a command set composed of at least one prosodic feature control command of said I layer, and the relationship between each prosodic feature control command of said S layer and said set of prosodic feature control commands of said I layer and prosody control rules indicating details of control of said prosodic parameters of said P layer by said prosodic feature control commands of said I layer being prestored in a prosody control rule database;
  
  (b) extracting from said text a prosodic parameter string of speech synthesized by rules;
  
  (c) controlling that one of said prosodic parameters of said prosodic parameter string corresponding to said character or character string to be added with said non-verbal information, by referring to said prosody control rules stored in said prosody control rule database; and
  
  (d) synthesizing speech from said prosodic parameter string containing said controlled prosodic parameter and for outputting a synthetic speech message.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein said prosodic parameter control in said step (c) is to change values of said parameters relative to said prosodic parameter string obtained in said step (b).
  - 3. The method of claim 1, wherein said prosodic parameter control in said step (c) is to change specified absolute values of said parameters with respect to said prosodic parameter string obtained in said step (b).
  - 4. The method of any one of claims 1 through 3, wherein said prosodic parameter control in said step (c) is to perform at least one of specifying the value of at least one of prosodic parameters for the amplitude, fundamental frequency and duration of the utterance concerned and specifying the shape of a time-varying pattern of each prosodic parameter.
  - 5. The method of any one of claims 1 through 3, wherein said set of prosodic feature control commands of said I layer, which define control of physical quantities of prosodic parameters of said P layer, is used as one prosodic feature control command of said S layer that represents the meaning of said non-verbal information.
  - 6. The method of any one of claims 1 through 3, wherein said step (c) is a step of detecting the positions of a phoneme and a syllable corresponding to said character or character string with reference to a dictionary in the language of the text and processing them by said prosodic feature control commands.
  - 7. The method of any one of claims 1 through 3, wherein said P layer is a cluster of prosodic parameters to be controlled, said prosodic feature control commands of said S layer are each cluster of words or phrases representing meanings of various pieces of non-verbal information and said prosodic feature control commands of said I layer are each a command that interprets said each prosodic feature control command of said S layer and defines the prosodic parameters of said P layer to be controlled and the control contents.

8. A method for editing non-verbal information by adding information of mental states to a speech message synthesized by rules in correspondence to a text, said method comprising the steps of:
- (a) extracting from said text a prosodic parameter string of speech synthesized by rules;
  
  (b) correcting that one of prosodic parameters of said prosodic parameter string corresponding to the character or character string to be added with said non-verbal information, through the use of at least one of basic prosody control rules defined by modifications of at least one of pitch patterns, power patterns and durations characteristic of a plurality of predetermined pieces of non-verbal information, respectively, said basic prosody control rules being stored in a memory in correspondence to predetermined mental states, respectively; and
  
  (c) synthesizing speech from said prosodic parameter string containing said corrected prosodic parameter and outputting a synthetic speech message;
  
  wherein a multi-layered description language is defined which comprises a semantic layer (hereinafter referred to as an S layer) composed of prosodic feature control commands each represented by a word or phrase indicative of an intended meaning of predetermined non-verbal information, an interpretation layer (hereinafter referred to as an I layer) composed of prosodic feature control commands each defining a physical meaning of control of prosodic parameters by one prosodic feature control command of said S layer and a parameter layer composed of a cluster of prosodic parameters of a control object, said method further comprising a step of describing in said multi-layered description language, the prosodic feature control command corresponding to said non-verbal information in said text at the position of said character or character string to be added with said non-verbal information.

9. A method for editing non-verbal information by adding information of mental states to a speech message synthesized by rules in correspondence to a text, said method comprising the steps of:
- (a) analyzing said text to extract therefrom a prosodic parameter string based on synthesis-by-rule speech;
  
  (b) correcting that one of prosodic parameters of said prosodic parameter string corresponding to the character or character string to be added with said non-verbal information, through the use of modifications of at least one of pitch patterns, power patterns and durations based on prosodic parameters characteristic of said non-verbal information, said basic prosody control rules being stored in a memory in correspondence to predetermined mental states, respectively;
  
  wherein said correcting step is performed following a prosodic feature control command described in said text in correspondence to said character or character string to be added with said non-verbal information;
  
  (c) synthesizing speech by said corrected prosodic parameter;
  
  (d) converting said modification information of said prosodic parameter to character conversion information such as the position, size, typeface and display color of each character in said text based on relationships between each one of combinations of values of different prosodies to effect a desired one of mental states and at least one of size and position of characters most matched as a visual impression of the mental state and between each one of variation patterns of at least one of prosody to effect another desired one of mental states and at least one of size and position of characters most matched as another desired visual impression of the mental state, these relationships being obtained through experiments; and
  
  (e) converting the characters of said text based on said character conversion information and displaying them accordingly;
  
  wherein a multi-layered description language is defined which comprises a semantic layer (hereinafter referred to as an S layer) composed of prosodic feature control commands each represented by a word or phrase indicative of an intended meaning of predetermined non-verbal information, an interpretation layer (hereinafter referred to as an I layer) composed of prosodic feature control commands each defining a physical meaning of control of prosodic parameters by one prosodic feature control command of said S layer and a parameter layer composed of a cluster of prosodic parameters of a control object, said method further comprising a step of describing in said multi-layered description language, the prosodic feature control command corresponding to said non-verbal information in said text at the position of said character or character string to be added with said non-verbal information.

10. A synthetic speech message editing/creating apparatus comprising:
- a text/prosodic feature control command input part into which a prosodic feature control command to be inserted in an input text is input, said phonological control command being described in a multi-layered description language composed of semantic, interpretation and parameter layers (hereinafter referred to simply as an S, an I and a P layer, respectively), said P layer being a group of controllable prosodic parameters including at least pitch and power, said I layer being a group of prosodic feature control commands for specifying details of control of said prosodic parameters of said P layer, and said S layer being a group of prosodic feature control commands each represented by a phrase or word indicative of an intended meaning of non-verbal information, for executing command sets each composed of at least one prosodic feature control command of said I layer;
  
  a text/prosodic feature control command separating part for separating said prosodic feature control command from said text;
  
  a speech synthesis information converting part for generating a prosodic parameter string from said separated text based on a “
  
  synthesis-by-rule”
  
  method;
  
  a prosodic feature control command analysis part for extracting, from said separated prosodic feature control command, information about its position in said text;
  
  a prosodic feature control part for controlling and correcting said prosodic parameter string based on said extracted position information and said separated prosodic feature control command; and
  
  speech synthesis part for generating synthetic speech based on said corrected prosodic parameter string from said prosodic feature control part.
- View Dependent Claims (11, 12)
- - 11. The apparatus of claim 10 further comprising:
12. The apparatus of claim 11, which further comprises a display type synthetic speech editing part provided with a display screen and GUI means, and wherein said display type synthetic speech editing part reads out a set of prosodic feature control commands corresponding to desired non-verbal information from said prosody control rule database and into said prosodic feature/prosodic feature control command conversion part, then displays said read-out set of prosodic feature control commands on said display screen, and corrects said set of prosodic feature control commands by said GUI, thereby updating the corresponding prosodic feature control command set in said prosody control rule database.

13. A recording medium having recorded thereon a procedure for editing/creating non-verbal information of a synthetic speech message by rules, said procedure comprising the steps of:
- (a) describing a prosodic feature control command corresponding to said non-verbal information in a multi-layered description language in an input text at the position of a character or character string to be added with said non-verbal information, said multi-layered description language being composed of a semantic layer (hereinafter referred to as an S layer), an interpretation layer (hereinafter referred to as an I layer) and a parameter layer (hereinafter referred to as a P layer);
  
  (b) extracting from said text a prosodic parameter string of speech synthesized by rules;
  
  (c) controlling that one of said prosodic parameter string corresponding to said character or character string to be added with said non-verbal information, by said prosodic feature control command; and
  
  (d) synthesizing speech from said prosodic parameter string containing said controlled prosodic parameter and outputting a synthetic speech message.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nippon Telegraph and Telephone Corporation
Original Assignee
Nippon Telegraph and Telephone Corporation
Inventors
Mizuno, Osamu, Nakajima, Shinya
Primary Examiner(s)
Smits, Talivaldis I.

Application Number

US09/080,268
Time in Patent Office

1,079 Days
Field of Search

704/258, 704/260, 704/266
US Class Current

704/260
CPC Class Codes

G10L 13/033 Voice editing, e.g. manipul...

G10L 13/04 Details of speech synthesis...

Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

240 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

240 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links