Method and system for achieving emotional text to speech utilizing emotion tags expressed as a set of emotion vectors
First Claim
1. A method for achieving emotional Text To Speech (TTS), the method comprising:
- receiving a set of text data;
organizing each of a plurality of words in the set of text data into a plurality of rhythm pieces;
generating an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories;
determining, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores;
determining, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories;
applying emotion smoothing to the set of text data based on the emotion tags generated for the plurality of rhythm pieces, wherein applying emotion smoothing comprisesdetermining a plurality of emotion paths based on adjacent probabilities between the final emotional categories determined for the plurality of rhythm pieces;
determining a final emotion path from the plurality of emotion paths based on a sum of adjacent probability and a sum of emotion score for each emotion path in the plurality of emotion paths; and
updating the final emotional category for each rhythm piece based on the final emotion path; and
performing, by at least one processor of at least one computing device, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprisesdecomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; and
synthesizing the at least one rhythm piece into audio comprising at least one emotion characteristic based on at least one speech feature of each phone in the set of phones,where the at least one speech feature is calculated as a function of at least the final emotion score, the updated final emotion category, a speech feature value of a given speech feature in a neutral emotion category, and a speech feature value of a given speech feature in the updated final emotion category.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for achieving emotional text to speech. The method includes: receiving text data; generating emotion tag for the text data by a rhythm piece; and achieving TTS to the text data corresponding to the emotion tag, where the emotion tags are expressed as a set of emotion vectors; where each emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories. A system for the same includes: a text data receiving module; an emotion tag generating module; and a TTS module for achieving TTS, wherein the emotion tag is expressed as a set of emotion vectors; and wherein emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories.
-
Citations
16 Claims
-
1. A method for achieving emotional Text To Speech (TTS), the method comprising:
-
receiving a set of text data; organizing each of a plurality of words in the set of text data into a plurality of rhythm pieces; generating an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories; determining, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores; determining, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories; applying emotion smoothing to the set of text data based on the emotion tags generated for the plurality of rhythm pieces, wherein applying emotion smoothing comprises determining a plurality of emotion paths based on adjacent probabilities between the final emotional categories determined for the plurality of rhythm pieces; determining a final emotion path from the plurality of emotion paths based on a sum of adjacent probability and a sum of emotion score for each emotion path in the plurality of emotion paths; and updating the final emotional category for each rhythm piece based on the final emotion path; and performing, by at least one processor of at least one computing device, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprises decomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; and synthesizing the at least one rhythm piece into audio comprising at least one emotion characteristic based on at least one speech feature of each phone in the set of phones, where the at least one speech feature is calculated as a function of at least the final emotion score, the updated final emotion category, a speech feature value of a given speech feature in a neutral emotion category, and a speech feature value of a given speech feature in the updated final emotion category. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A system for achieving emotional Text To Speech (TTS), comprising:
-
at least one memory; and at least one processor communicatively coupled to the at least one memory, the at least one processor configured to perform a method comprising; organizing each of a plurality of words in a set of text data into a plurality of rhythm pieces; generating an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories; determining, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores; determining, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories; applying emotion smoothing to the set of text data based on the emotion tags generated for the plurality of rhythm pieces, wherein applying emotion smoothing comprises determining a plurality of emotion paths based on adjacent probabilities between the final emotional categories determined for the plurality of rhythm pieces; determining a final emotion path from the plurality of emotion paths based on a sum of adjacent probability and a sum of emotion score for each emotion path in the plurality of emotion paths; and updating the final emotional category for each rhythm piece based on the final emotion path; and performing, by at least one processor of at least one computing device, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprises decomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; and synthesizing the at least one rhythm piece into audio comprising at least one emotion characteristic based on at least one speech feature of each phone in the set of phones, where the at least one speech feature is calculated as a function of at least the final emotion score and the final emotion category, and where the at least one speech feature is calculated as a function of at least the final emotion score, the updated final emotion category, a speech feature value of a given speech feature in a neutral emotion category, and a speech feature value of a given speech feature in the updated final emotion category. - View Dependent Claims (14, 15)
-
-
16. A computer program product for achieving emotional Text To Speech (TTS), the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to:
-
receive a set of text data; organize each of a plurality of words in the set of text data into a plurality of rhythm pieces; generate an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories; determine, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores; determine, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories; apply emotion smoothing to the set of text data based on the emotion tags generated for the plurality of rhythm pieces, wherein applying emotion smoothing comprises determining a plurality of emotion paths based on adjacent probabilities between the final emotional categories determined for the plurality of rhythm pieces; determining a final emotion path from the plurality of emotion paths based on a sum of adjacent probability and a sum of emotion score for each emotion path in the plurality of emotion paths; and updating the final emotional category for each rhythm piece based on the final emotion path; and perform, by at least one processor of at least one computing device, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprises decomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; and synthesizing the at least one rhythm piece into audio comprising at least one emotion characteristic based on at least one speech feature of each phone in the set of phones, where the at least one speech feature is calculated as a function of at least the final emotion score and the final emotion category, and where the at least one speech feature is calculated as a function of at least the final emotion score, the updated final emotion category, a speech feature value of a given speech feature in a neutral emotion category, and a speech feature value of a given speech feature in the updated final emotion category.
-
Specification