Emotional-speech synthesizing device, method of operating the same and mobile terminal including the same
First Claim
1. A method for emotional speech synthesizing of a mobile terminal, the method comprising:
- receiving, via a controller, a control command for outputting of emotional speech;
recognizing, via the controller, a sentence comprising words that is input;
calculating, via the controller, a probability vector of multiple pre-defined emotions for each of the words that makes up the recognized sentence, the probability vector means a value of frequency of usage of each of the multiple pre- defined emotions for each of the words in a database (DB) environment;
applying, via the controller, a weight of the probability vector of the multiple pre-defined emotions of each of the words that are used in a real environment;
adjusting a final value of the probability vector based on context information on the recognized sentence;
estimating, via the controller, an emotion and a rhythm of each of the words;
generating, via the controller, one integration emotion rhythm model based on the estimated rhythm and the context information, wherein the one integration emotion rhythm model estimates one integration rhythm based on the context information on the recognized sentence without estimating a separate rhythm for the emotion of each word;
calculating, via the controller, in stages degrees of similarity in an emotion and a rhythm between adjacent words of the recognized sentence based on the estimated emotion and the generated integration emotion rhythm model wherein the probability vector of the multiple pre-defined emotions is updated to reflect the result of learning that is obtained through calculations of the probability vector;
applying, via the controller, a different weight to all phoneme candidates corresponding to each of the words based on the degrees of the similarity in the estimated emotion and the estimated rhythm and the final value of the probability vector;
selecting, via the controller, one phoneme candidate having a pitch contour that has a minimum distance value from a target pitch contour, among all the phoneme candidates to which the different weight is applied through a Viterbi search that is based on a cost function; and
synthesizing, via the controller, an emotional speech that corresponds to the recognized sentence in optimal units by connecting the selected phoneme candidate for each of the words;
outputting the emotional speech that is synthesized from the input text sentence; and
displaying the input text sentence at the same speech as the speaker output the emotional speech.
1 Assignment
0 Petitions
Accused Products
Abstract
Provided is an emotional-speech synthesizing device including: a sentence recognition unit that recognizes a sentence that is input; a word emotion determination unit that calculates probability vector of an emotion that is pre-defined for each word that makes up the recognized sentence and estimates the emotion and a rhythm based on the probability vector; and an emotional-speech synthesizing unit. The emotional-speech synthesizing unit calculates in stages degrees of similarity in the emotion and the rhythm between the adjacent words based on context information on the recognized sentence, applies weight to a phoneme candidate corresponding to the each word based on the degrees of the similarity and the probability vector, selects the phoneme candidate that has a minimum target pitch, minimum duration time, a minimum distance value of a target pitch contour, and thus synthesizes an emotional speech that corresponds to the recognized sentence in optimal units.
-
Citations
5 Claims
-
1. A method for emotional speech synthesizing of a mobile terminal, the method comprising:
-
receiving, via a controller, a control command for outputting of emotional speech; recognizing, via the controller, a sentence comprising words that is input; calculating, via the controller, a probability vector of multiple pre-defined emotions for each of the words that makes up the recognized sentence, the probability vector means a value of frequency of usage of each of the multiple pre- defined emotions for each of the words in a database (DB) environment; applying, via the controller, a weight of the probability vector of the multiple pre-defined emotions of each of the words that are used in a real environment; adjusting a final value of the probability vector based on context information on the recognized sentence; estimating, via the controller, an emotion and a rhythm of each of the words; generating, via the controller, one integration emotion rhythm model based on the estimated rhythm and the context information, wherein the one integration emotion rhythm model estimates one integration rhythm based on the context information on the recognized sentence without estimating a separate rhythm for the emotion of each word; calculating, via the controller, in stages degrees of similarity in an emotion and a rhythm between adjacent words of the recognized sentence based on the estimated emotion and the generated integration emotion rhythm model wherein the probability vector of the multiple pre-defined emotions is updated to reflect the result of learning that is obtained through calculations of the probability vector; applying, via the controller, a different weight to all phoneme candidates corresponding to each of the words based on the degrees of the similarity in the estimated emotion and the estimated rhythm and the final value of the probability vector; selecting, via the controller, one phoneme candidate having a pitch contour that has a minimum distance value from a target pitch contour, among all the phoneme candidates to which the different weight is applied through a Viterbi search that is based on a cost function; and synthesizing, via the controller, an emotional speech that corresponds to the recognized sentence in optimal units by connecting the selected phoneme candidate for each of the words; outputting the emotional speech that is synthesized from the input text sentence; and displaying the input text sentence at the same speech as the speaker output the emotional speech. - View Dependent Claims (2)
-
-
3. A mobile terminal comprising:
-
a key configured to input a control command for synthesizing an emotional speech; a memory configured to store an emotion word dictionary in which each of the words is classified as an entry having multiple pre-defined emotions; a controller configured to; receive at least one sentence comprising words that is input as text, based on the control command, calculate a probability vector of the multiple pre-defined emotions for each of the words that makes up the recognized sentence, the probability vector means a value of frequency of usage of each of the multiple pre-defined emotions for each of the words in a database (DB) environment, wherein the probability vector of the multiple pre-defined emotions is updated to reflect the result of learning that is obtained through calculations of the probability vector, apply a weight of the probability vector of the multiple pre-defined emotions of each of the words that are used in a real environment, adjust a final value of the probability vector based on context information on the recognized sentence, estimate an emotion and a rhythm of each of the words, generate one integration emotion rhythm model based on the estimated rhythm and the context information, wherein the one integration emotion rhythm model estimates one integration rhythm based on the context information on the recognized sentence without estimating a separate rhythm for the emotion of each word, calculate in stages degrees of similarity in an emotion and a rhythm between adjacent words of the recognized sentence based on the estimated emotion and the generated integration emotion rhythm model, apply a different weight to all phoneme candidates corresponding to each of the words based on the degrees of the similarity in the estimated emotion and the estimated rhythm and the final value of the probability vector, select one phoneme having a pitch contour that has a minimum distance value from a target pitch contour, among all the phoneme candidates to which the different weight is applied through a Viterbi search that is based on a cost function, and synthesize the emotional speech that corresponds to the recognized sentence in optimal units by connecting the selected phoneme candidate for each of the words; a speaker configured to output the emotional speech that is synthesized from the input text sentence; and a display configured to display the input text sentence at the same speed as the speaker outputs the emotional speech. - View Dependent Claims (4, 5)
-
Specification