Method and system for achieving emotional text to speech utilizing emotion tags expressed as a set of emotion vectors
First Claim
1. A method for achieving emotional Text To Speech (TTS), the method comprising:
- receiving a set of text data;
organizing each of a plurality of words in the set of text data into a plurality of rhythm pieces;
generating an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories;
determining, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores;
determining, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories; and
performing, by at least one processor of at least one computing device, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprisesdecomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; and
synthesizing the at least one rhythm piece into audio comprising at least one emotion characteristic based on at least one speech feature of each phone in the set of phones, where the at least one speech feature is calculated as a function of at least the final emotion score and the final emotion category, and wherein calculating the at least one speech feature is based on;
Fi=(1−
Pemotion)*Fi-neutral+Pemotion*Fi-emotion wherein;
Fi is a value of an ith speech feature of one of the set of phones,Pemotion is the final emotion score of the rhythm piece where one of plurality of phones lies,Fi-neutral is a first speech feature value of an ith speech feature in a neutral emotion category, andFi-emotion is a second speech feature value of an ith speech feature in the final emotion category.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for achieving emotional text to speech. The method includes: receiving text data; generating emotion tag for the text data by a rhythm piece; and achieving TTS to the text data corresponding to the emotion tag, where the emotion tags are expressed as a set of emotion vectors; where each emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories. A system for the same includes: a text data receiving module; an emotion tag generating module; and a TTS module for achieving TTS, wherein the emotion tag is expressed as a set of emotion vectors; and wherein emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories.
-
Citations
19 Claims
-
1. A method for achieving emotional Text To Speech (TTS), the method comprising:
-
receiving a set of text data; organizing each of a plurality of words in the set of text data into a plurality of rhythm pieces; generating an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories; determining, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores; determining, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories; and performing, by at least one processor of at least one computing device, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprises decomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; and synthesizing the at least one rhythm piece into audio comprising at least one emotion characteristic based on at least one speech feature of each phone in the set of phones, where the at least one speech feature is calculated as a function of at least the final emotion score and the final emotion category, and wherein calculating the at least one speech feature is based on;
Fi=(1−
Pemotion)*Fi-neutral+Pemotion*Fi-emotionwherein; Fi is a value of an ith speech feature of one of the set of phones, Pemotion is the final emotion score of the rhythm piece where one of plurality of phones lies, Fi-neutral is a first speech feature value of an ith speech feature in a neutral emotion category, and Fi-emotion is a second speech feature value of an ith speech feature in the final emotion category. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A system for achieving emotional Text To Speech (TTS), comprising:
-
at least one memory; and at least one processor communicatively coupled to the at least one memory, the at least one processor configured to perform a method comprising; organizing each of a plurality of words in a set of text data into a plurality of rhythm pieces; generating an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories; determining, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores; determining, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories; and performing, by at least one processor of at least one computing device, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprises decomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; and synthesizing the at least one rhythm piece into audio comprising at least one emotion characteristic based on at least one speech feature of each phone in the set of phones, where the at least one speech feature is calculated as a function of at least the final emotion score and the final emotion category, and wherein calculating the at least one speech feature is based on;
Fi=(1−
Pemotion)*Fi-neutral+Pemotion*Fi-emotionwherein; Fi is a value of an ith speech feature of one of the set of phones, Pemotion is the final emotion score of the rhythm piece where one of plurality of phones lies, Fi-neutral is a first speech feature value of an ith speech feature in a neutral emotion category, and Fi-emotion is a second speech feature value of an ith speech feature in the final emotion category. - View Dependent Claims (15, 16, 17, 18)
-
-
19. A computer program product for achieving emotional Text To Speech (TTS), the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to:
-
receive a set of text data; organize each of a plurality of words in the set of text data into a plurality of rhythm pieces; generate an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories; determine, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores; determine, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories; and perform, by at least one processor of at least one computing device, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprises decomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; and synthesizing the at least one rhythm piece into audio comprising at least one emotion characteristic based on at least one speech feature of each phone in the set of phones, where the at least one speech feature is calculated as a function of at least the final emotion score and the final emotion category, and wherein calculating the at least one speech feature is based on;
Fi=(1−
Pemotion)*Fi-neutral+Pemotion*Fi-emotionwherein; Fi is a value of an ith speech feature of one of the set of phones, Pemotion is the final emotion score of the rhythm piece where one of plurality of phones lies, Fi-neutral is a first speech feature value of an ith speech feature in a neutral emotion category, and Fi-emotion is a second speech feature value of an ith speech feature in the final emotion category.
-
Specification