Using emoticons for contextual text-to-speech expressivity

US 9,767,789 B2
Filed: 08/29/2012
Issued: 09/19/2017
Est. Priority Date: 08/29/2012
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving, by a computing system, data comprising text, and a plurality of emoticons;

performing, by the computing system, a text-to-speech conversion of the data, wherein the text-to-speech conversion of the data further comprises;

determining, by the computing system, a local expressivity corresponding to a group of emoticons of the plurality of emoticons based on a calculation of boundaries of the text, wherein each emoticon of the group of emoticons is located in proximity to a phrase associated with the text within the boundaries that each emoticon is associated with and wherein the local expressivity is associated with a first audio intensity level;

determining, by the computing system, a global expressivity for the data, wherein the global expressivity corresponds to a global multiplier determined after parsing an entire text without the boundaries and the global multiplier modifies the first audio intensity level;

determining, by the computing system, a second audio intensity level associated with the global expressivity; and

generating, by the computing system and based on the modified first audio intensity level and the second audio intensity level, an audible signal representative of the text-to-speech conversion of the data.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques disclosed herein include systems and methods that improve audible emotional characteristics used when synthesizing speech from a text source. Systems and methods herein use emoticons identified from a source text to provide contextual text-to-speech expressivity. In general, techniques herein analyze text and identify emoticons included within the text. The source text is then tagged with corresponding mood indicators. For example, if the system identifies an emoticon at the end of a sentence, then the system can infer that this sentence has a specific tone or mood associated with it. Depending on whether the emoticon is a smiley face, angry face, sad face, laughing face, etc., the system can infer use or mood from the various emoticons and then change or modify the expressivity of the TTS output such as by changing intonation, prosody, speed, pauses, and other expressivity characteristics.

35 Citations

View as Search Results

20 Claims

1. A computer-implemented method comprising:
- receiving, by a computing system, data comprising text, and a plurality of emoticons;
  
  performing, by the computing system, a text-to-speech conversion of the data, wherein the text-to-speech conversion of the data further comprises;
  
  determining, by the computing system, a local expressivity corresponding to a group of emoticons of the plurality of emoticons based on a calculation of boundaries of the text, wherein each emoticon of the group of emoticons is located in proximity to a phrase associated with the text within the boundaries that each emoticon is associated with and wherein the local expressivity is associated with a first audio intensity level;
  
  determining, by the computing system, a global expressivity for the data, wherein the global expressivity corresponds to a global multiplier determined after parsing an entire text without the boundaries and the global multiplier modifies the first audio intensity level;
  
  determining, by the computing system, a second audio intensity level associated with the global expressivity; and
  
  generating, by the computing system and based on the modified first audio intensity level and the second audio intensity level, an audible signal representative of the text-to-speech conversion of the data.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-implemented method of claim 1, further comprising:
    - determining a respective mood corresponding to each emoticon of the plurality of emoticons;
      
      determining, by the computing system and based on the respective mood corresponding to each emoticon of the plurality of emoticons, one or more confidence levels associated with the group of emoticons; and
      
      modifying, based on the one or more confidence levels, the global multiplier.
  - 3. The computer-implemented method of claim 1, further comprising:
    - determining, based on the modified first audio intensity level, an audible expressivity tag for the group of emoticons, andmodifying the audible expressivity tag based on identifying a font associated with the phrase.
  - 4. The computer-implemented method of claim 1, further comprising:
    - determining, by the computing system, a mood transition based on a first emoticon of the plurality of emoticons being in close proximity to a second emoticon of the plurality of emoticons; and
      
      determining, by the computing system, a mood transition tag that is configured to smooth the mood transition by changing an intensity of the audible signal during the text-to-speech conversion of the data corresponding to the first emoticon of the plurality of emoticons and the second emoticon of the plurality of emoticons.
  - 5. The computer-implemented method of claim 1, further comprising:
    - receiving, by the computing system and from a user device, a user input indicating a user-selected portion of the data, wherein the user input is based on a sliding window option, displayable by the user device, for delimiting the portion of the data;
      
      determining, by the computing system, a number of mood transitions associated with a plurality of moods corresponding to the portion of the data; and
      
      determining, by the computing system, a confidence level for each mood of the plurality of moods and an intensity level for each mood of the plurality of moods.
  - 6. The computer-implemented method of claim 5, further comprising:
    - modifying, by the computing system, the global multiplier based on the confidence level for each mood of the plurality of moods and the intensity level for each mood of the plurality of moods and further based on the number of mood transitions; and
      
      performing, by the computing system, the text-to-speech conversion of the data based on the modified global multiplier.
  - 7. The computer-implemented method of claim 1, wherein the determining the second audio intensity level is based on a global analysis of the data, and wherein the global analysis of the data further comprises:
    - determining, by the computing system, one or more pauses associated with the data based on an identification of one or more punctuations in the data, the one or more pauses being configured to change a confidence level associated with an emoticon of the plurality of emoticons.

8. A system comprising:
- at least one processor; and
  
  a memory storing instructions that when executed by the at least one processor cause the system to convert text to speech by configuring the system to;
  
  receive data comprising text and a plurality of emoticons;
  
  determine a local expressivity corresponding to a group of emoticons of the plurality of emoticons based on a calculation of boundaries of the text, wherein the group of emoticons is located in proximity to a phrase of the text within the boundaries;
  
  determine, based on the local expressivity, a first audio intensity level;
  
  determine a global expressivity for the data, wherein the global expressivity corresponds to a global multiplier determined after parsing an entire text without the boundaries and the global multiplier modifies the first audio intensity level;
  
  determine a second audio intensity level associated with the global expressivity; and
  
  generate, based on the modified first audio intensity level and the second audio intensity level, an audible signal representing a text-to-speech conversion of the data.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the instructions, when executed by the at least one processor, further cause the system to:
    - determine, a first confidence level for a mood associated with the data and a first intensity level for the mood; and
      
      determine, based on the first confidence level and based on the first intensity level, a second intensity level associated with the mood that is configured to alter the global expressivity.
  - 10. The system of claim 8, wherein the instructions, when executed by the at least one processor, cause the system to:
    - determine, based on the modified first audio intensity level, an audible expressivity tag for the group of emoticons; and
      
      modify the audible expressivity tag based on identifying a font associated with the phrase.
  - 11. The system of claim 8, wherein the instructions, when executed by the at least one processor, cause the system to:
    - determine a mood transition based on a first emoticon of the plurality of emoticons being in close proximity to a second emoticon of the plurality of emoticons; and
      
      determine, a mood transition tag that is configured to smooth the mood transition by changing an intensity of the audible signal during the text-to-speech conversion of the data corresponding to the first emoticon of the plurality of emoticons and the second emoticon of the plurality of emoticons.
  - 12. The system of claim 8, wherein the instructions, when executed by the at least one processor, cause the system to:
    - receive, from a user device, a user input indicative of a user-selected portion of the data, wherein the user input is based on a sliding window option, displayable by the user device, for delimiting the portion of the data;
      
      determine a number of mood transitions associated with a plurality of moods corresponding to the portion of the data; and
      
      determine a confidence level for each mood of the plurality of moods and an intensity level for each mood of the plurality of moods, based on a global analysis of the portion of the data, the confidence level and the intensity level for each mood of the plurality of moods being configured to alter the second audio intensity level associated with the global expressivity.
  - 13. The system of claim 12, wherein the instructions, when executed by the at least one processor, cause the system to:
    - determine a mood associated with each emoticon of the plurality of emoticons;
      
      modify the global multiplier based on the confidence level for each mood of the plurality of moods and the intensity level for each mood of the plurality of moods and further based on the number of mood transitions; and
      
      perform the text-to-speech conversion of the data based on the modified global multiplier.
  - 14. The system of claim 8, wherein the instructions, when executed by the at least one processor, cause the system to:
    - determine one or more pauses associated with the data based on an identification of one or more punctuations in the data, the one or more pauses being configured to modify a confidence level associated with an emoticon of the plurality of emoticons; and
      
      determine the second audio intensity level based on the modified confidence level.

15. One or more non-transitory computer-readable media having instructions stored thereon that when executed by one or more computers cause the one or more computers to convert text to speech by configuring the one or more computers to:
- receive data comprising text and a plurality of emoticons;
  
  determine a local expressivity corresponding to a group of emoticons of the plurality of emoticons based on a calculation of boundaries of the text, wherein each emoticon of the group of emoticons is located in proximity to a phrase of the text within the boundaries;
  
  determine, based on the local expressivity, a first audio intensity level;
  
  determine a global expressivity for the data, wherein the global expressivity corresponds to a global multiplier determined after parsing an entire text without the boundaries and the global multiplier modifies the first audio intensity level;
  
  determine a second audio intensity level associated with the global expressivity; and
  
  generate, based on the modified first audio intensity level and the second audio intensity level, an audible signal representative of text-to-speech conversion of the data.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The one or more non-transitory computer-readable media of claim 15, wherein the instructions, when executed by the one or more computers, cause the one or more computers to:
    - determine a confidence level for a respective mood associated with each emoticon of the plurality of emoticons and an intensity level for the respective mood; and
      
      modify, based on the confidence level and the intensity level, the global multiplier.
  - 17. The one or more non-transitory computer-readable media of claim 15, wherein the instructions, when executed by the one or more computers, cause the one or more computers to update an audible expressivity tag associated with the first audio intensity level based on identifying a font associated with the phrase.
  - 18. The one or more non-transitory computer-readable media of claim 15, wherein the instructions, when executed by the one or more computers, cause the one or more computers to:
    - generate a first mood tag corresponding to a first emoticon of the plurality of emoticons and a second mood tag corresponding to a second emoticon of the plurality of emoticons;
      
      determine a mood transition corresponding to the first mood tag and based on the first emoticon of the plurality of emoticons being in close proximity to the second emoticon of the plurality of emoticons; and
      
      determine, a mood transition tag associated with the mood transition configured to smooth the mood transition by changing an intensity of the audible signal during the text-to-speech conversion of the data.
  - 19. The one or more non-transitory computer-readable media of claim 15, wherein the instructions, when executed by the one or more computers, cause the one or more computers to:
    - receive, from a user device, a user input indicating a user-selected portion of the data, wherein the user input is based on a sliding window option, displayable by the user device, for delimiting the portion of the data;
      
      determine a number of mood transitions associated with a plurality of moods corresponding to a portion of the data; and
      
      determine a confidence level for each mood of the plurality of moods and an intensity level for each mood of the plurality of moods, based on a global analysis of the portion of the data, the confidence level and the intensity level for each mood of the plurality of moods being configured to alter the second audio intensity level.
  - 20. The one or more non-transitory computer-readable media of claim 15, wherein the instructions, when executed by the one or more computers, cause the one or more computers to:
    - determine a mood associated with each emoticon of the plurality of emoticons;
      
      determine at least one confidence level and at least one intensity level associated with the mood; and
      
      modify the global multiplier based on the at least one confidence level for the mood and the at least one intensity level for the mood.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Radebaugh, Carey
Primary Examiner(s)
JACKSON, JAKIEDA R

Application Number

US13/597,372
Publication Number

US 20140067397A1
Time in Patent Office

1,847 Days
Field of Search

704260
US Class Current
CPC Class Codes

G10L 13/08 Text analysis or generation...

G10L 2013/083 Special characters, e.g. pu...

Using emoticons for contextual text-to-speech expressivity

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

35 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Using emoticons for contextual text-to-speech expressivity

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

35 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others