Method and system for achieving emotional text to speech utilizing emotion tags expressed as a set of emotion vectors

US 9,570,063 B2
Filed: 07/23/2015
Issued: 02/14/2017
Est. Priority Date: 08/31/2010
Status: Active Grant

First Claim

Patent Images

1. A method for achieving emotional Text To Speech (TTS), the method comprising:

receiving a set of text data;

organizing each of a plurality of words in the set of text data into a plurality of rhythm pieces;

generating an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories;

determining, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores;

determining, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories; and

performing, by at least one processor of at least one computing device, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprisesdecomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; and

synthesizing the at least one rhythm piece into audio comprising at least one emotion characteristic based on at least one speech feature of each phone in the set of phones, where the at least one speech feature is calculated as a function of at least the final emotion score and the final emotion category, and wherein calculating the at least one speech feature is based on;

F_i=(1−

P_emotion)*F_i-neutral+P_emotion*F_i-emotionwherein;

F_iis a value of an i^thspeech feature of one of the set of phones,P_emotionis the final emotion score of the rhythm piece where one of plurality of phones lies,F_i-neutralis a first speech feature value of an i^thspeech feature in a neutral emotion category, andF_i-emotionis a second speech feature value of an i^thspeech feature in the final emotion category.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for achieving emotional text to speech. The method includes: receiving text data; generating emotion tag for the text data by a rhythm piece; and achieving TTS to the text data corresponding to the emotion tag, where the emotion tags are expressed as a set of emotion vectors; where each emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories. A system for the same includes: a text data receiving module; an emotion tag generating module; and a TTS module for achieving TTS, wherein the emotion tag is expressed as a set of emotion vectors; and wherein emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories.

Citations

19 Claims

1. A method for achieving emotional Text To Speech (TTS), the method comprising:
- receiving a set of text data;
  
  organizing each of a plurality of words in the set of text data into a plurality of rhythm pieces;
  
  generating an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories;
  
  determining, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores;
  
  determining, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories; and
  
  performing, by at least one processor of at least one computing device, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprisesdecomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; and
  
  synthesizing the at least one rhythm piece into audio comprising at least one emotion characteristic based on at least one speech feature of each phone in the set of phones, where the at least one speech feature is calculated as a function of at least the final emotion score and the final emotion category, and wherein calculating the at least one speech feature is based on;
  
  F_i=(1−
  
  P_emotion)*F_i-neutral+P_emotion*F_i-emotionwherein;
  
  F_iis a value of an i^thspeech feature of one of the set of phones,P_emotionis the final emotion score of the rhythm piece where one of plurality of phones lies,F_i-neutralis a first speech feature value of an i^thspeech feature in a neutral emotion category, andF_i-emotionis a second speech feature value of an i^thspeech feature in the final emotion category.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method according to claim 1, wherein determining the final emotion score comprises:
    - designating the final emotion score as an emotion score in the plurality of emotion scores comprising a highest value.
  - 3. The method according to claim 1, further comprising:
    - adjusting, for at least one of the plurality of rhythm pieces, at least one emotion score in the plurality of emotion scores according to a context of the rhythm piece; and
      
      determining the final emotion score and the final emotion category of the rhythm piece based on the plurality of emotion scores comprising the at least one emotion score that has been adjusted.
  - 4. The method according to claim 3, wherein adjusting the at least one emotion score further comprises:
    - adjusting the at least one emotion score based on an emotion vector adjustment decision tree, wherein the emotion vector adjustment decision tree is established based on emotion vector adjustment training data.
  - 5. The method according to claim 1, further comprising:
    - applying emotion smoothing to the set of text data based on the emotion tags generated for the plurality of rhythm pieces.
  - 6. The method according to claim 5, wherein applying emotion smoothing comprises:
    - obtaining an adjacent probability that a first emotion category associated with a first of the plurality of rhythm pieces is connected to a second emotion category of a second of the plurality of rhythm pieces that is adjacent to the first of the plurality of rhythm pieces;
      
      determining a final emotion path of the set of text data based on the adjacent probability and a plurality of emotion scores of corresponding emotion categories; and
      
      determining the final emotion category of each of the plurality of rhythm pieces based on the final emotion path.
  - 7. The method according to claim 6, further comprising:
    - determining the final emotion score from the final emotion category, wherein the final emotion score has a highest value in the plurality of emotion scores.
  - 8. The method according to claim 6, wherein obtaining an adjacent probability further comprises:
    - performing a statistical analysis on emotion adjacent training data, wherein the statistical analysis records a number of times where at least two of the plurality of emotion categories had been adjacent in the emotion adjacent training data.
  - 9. The method according to claim 8, further comprising:
    - expanding the emotion adjacent training data based on the formed final emotion path.
  - 10. The method according to claim 8, further comprising:
    - expanding the emotion adjacent training data by connecting at least one of the plurality of emotion categories with a highest value in the plurality of emotion scores.
  - 11. The method according to claim 1, wherein calculating the at least one speech feature of each phone further comprises:
    - determining if the final emotion score of the rhythm piece where the phone lies is greater than a certain threshold, based on;
      
      F_i=F_i-emotion.
  - 12. The method according to claim 1, wherein calculating the at least one speech feature of each phone further comprises:
    - determining if the final emotion score of the rhythm piece where one the phone lies is smaller than a certain threshold, based on;
      
      F_i=F_i-neutral.
  - 13. The method according to claim 1, wherein the speech feature comprises at least one of:
    - a basic frequency feature,a frequency spectrum feature,a time length feature, anda combination thereof.

14. A system for achieving emotional Text To Speech (TTS), comprising:
- at least one memory; and
  
  at least one processor communicatively coupled to the at least one memory, the at least one processor configured to perform a method comprising;
  
  organizing each of a plurality of words in a set of text data into a plurality of rhythm pieces;
  
  generating an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories;
  
  determining, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores;
  
  determining, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories; and
  
  performing, by at least one processor of at least one computing device, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprisesdecomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; and
  
  synthesizing the at least one rhythm piece into audio comprising at least one emotion characteristic based on at least one speech feature of each phone in the set of phones, where the at least one speech feature is calculated as a function of at least the final emotion score and the final emotion category, and wherein calculating the at least one speech feature is based on;
  
  F_i=(1−
  
  P_emotion)*F_i-neutral+P_emotion*F_i-emotionwherein;
  
  F_iis a value of an i^thspeech feature of one of the set of phones,P_emotionis the final emotion score of the rhythm piece where one of plurality of phones lies,F_i-neutralis a first speech feature value of an i^thspeech feature in a neutral emotion category, andF_i-emotionis a second speech feature value of an i^thspeech feature in the final emotion category.
- View Dependent Claims (15, 16, 17, 18)
- - 15. The system of claim 14, wherein determining the final emotion score comprises:
    - designating the final emotion score as an emotion score in the plurality of emotion scores comprising a highest value.
  - 16. The system of claim 14, wherein the method further comprises:
    - adjusting, for at least one of the plurality of rhythm pieces, at least one emotion score in the plurality of emotion scores according to a context of the rhythm piece; and
      
      determining the final emotion score and the final emotion category of the rhythm piece based on the plurality of emotion scores comprising the at least one emotion score that has been adjusted.
  - 17. The system of claim 14, wherein the method further comprises:
    - applying emotion smoothing to the set of text data based on the emotion tags generated for the plurality of rhythm pieces.
  - 18. The system of claim 17, wherein applying emotion smoothing further comprises:
    - obtaining an adjacent probability that a first emotion category associated with a first of the plurality of rhythm pieces is connected to a second emotion category of a second of the plurality of rhythm pieces that is adjacent to the first of the plurality of rhythm pieces;
      
      determining a final emotion path of the set of text data based on the adjacent probability and a plurality of emotion scores of corresponding emotion categories; and
      
      determining the final emotion category of each of the plurality of rhythm pieces based on the final emotion path.

19. A computer program product for achieving emotional Text To Speech (TTS), the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to:
- receive a set of text data;
  
  organize each of a plurality of words in the set of text data into a plurality of rhythm pieces;
  
  generate an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories;
  
  determine, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores;
  
  determine, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories; and
  
  perform, by at least one processor of at least one computing device, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprisesdecomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; and
  
  synthesizing the at least one rhythm piece into audio comprising at least one emotion characteristic based on at least one speech feature of each phone in the set of phones, where the at least one speech feature is calculated as a function of at least the final emotion score and the final emotion category, and wherein calculating the at least one speech feature is based on;
  
  F_i=(1−
  
  P_emotion)*F_i-neutral+P_emotion*F_i-emotionwherein;
  
  F_iis a value of an i^thspeech feature of one of the set of phones,P_emotionis the final emotion score of the rhythm piece where one of plurality of phones lies,F_i-neutralis a first speech feature value of an i^thspeech feature in a neutral emotion category, andF_i-emotionis a second speech feature value of an i^thspeech feature in the final emotion category.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Bao, Shenghua, Chen, Jian, Qin, Yong, Shi, Qin, Shuang, Zhiwei, Su, Zhong, Wen, Liu, Zhang, Shi Lei
Primary Examiner(s)
WOZNIAK, JAMES S

Application Number

US14/807,052
Publication Number

US 20150325233A1
Time in Patent Office

572 Days
Field of Search

704/9, 704/258, 704/260, 704/266, 704/268
US Class Current

1/1
CPC Class Codes

G10L 13/02   Methods for producing synth...

G10L 13/08   Text analysis or generation...

G10L 13/10   Prosody rules derived from ...

Method and system for achieving emotional text to speech utilizing emotion tags expressed as a set of emotion vectors

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for achieving emotional text to speech utilizing emotion tags expressed as a set of emotion vectors

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links