Using emoticons for contextual text-to-speech expressivity
First Claim
1. A computer-implemented method comprising:
- receiving, by a computing system, data comprising text, and a plurality of emoticons;
performing, by the computing system, a text-to-speech conversion of the data, wherein the text-to-speech conversion of the data further comprises;
determining, by the computing system, a local expressivity corresponding to a group of emoticons of the plurality of emoticons based on a calculation of boundaries of the text, wherein each emoticon of the group of emoticons is located in proximity to a phrase associated with the text within the boundaries that each emoticon is associated with and wherein the local expressivity is associated with a first audio intensity level;
determining, by the computing system, a global expressivity for the data, wherein the global expressivity corresponds to a global multiplier determined after parsing an entire text without the boundaries and the global multiplier modifies the first audio intensity level;
determining, by the computing system, a second audio intensity level associated with the global expressivity; and
generating, by the computing system and based on the modified first audio intensity level and the second audio intensity level, an audible signal representative of the text-to-speech conversion of the data.
7 Assignments
0 Petitions
Accused Products
Abstract
Techniques disclosed herein include systems and methods that improve audible emotional characteristics used when synthesizing speech from a text source. Systems and methods herein use emoticons identified from a source text to provide contextual text-to-speech expressivity. In general, techniques herein analyze text and identify emoticons included within the text. The source text is then tagged with corresponding mood indicators. For example, if the system identifies an emoticon at the end of a sentence, then the system can infer that this sentence has a specific tone or mood associated with it. Depending on whether the emoticon is a smiley face, angry face, sad face, laughing face, etc., the system can infer use or mood from the various emoticons and then change or modify the expressivity of the TTS output such as by changing intonation, prosody, speed, pauses, and other expressivity characteristics.
35 Citations
20 Claims
-
1. A computer-implemented method comprising:
-
receiving, by a computing system, data comprising text, and a plurality of emoticons; performing, by the computing system, a text-to-speech conversion of the data, wherein the text-to-speech conversion of the data further comprises; determining, by the computing system, a local expressivity corresponding to a group of emoticons of the plurality of emoticons based on a calculation of boundaries of the text, wherein each emoticon of the group of emoticons is located in proximity to a phrase associated with the text within the boundaries that each emoticon is associated with and wherein the local expressivity is associated with a first audio intensity level; determining, by the computing system, a global expressivity for the data, wherein the global expressivity corresponds to a global multiplier determined after parsing an entire text without the boundaries and the global multiplier modifies the first audio intensity level; determining, by the computing system, a second audio intensity level associated with the global expressivity; and generating, by the computing system and based on the modified first audio intensity level and the second audio intensity level, an audible signal representative of the text-to-speech conversion of the data. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
at least one processor; and a memory storing instructions that when executed by the at least one processor cause the system to convert text to speech by configuring the system to; receive data comprising text and a plurality of emoticons; determine a local expressivity corresponding to a group of emoticons of the plurality of emoticons based on a calculation of boundaries of the text, wherein the group of emoticons is located in proximity to a phrase of the text within the boundaries; determine, based on the local expressivity, a first audio intensity level; determine a global expressivity for the data, wherein the global expressivity corresponds to a global multiplier determined after parsing an entire text without the boundaries and the global multiplier modifies the first audio intensity level; determine a second audio intensity level associated with the global expressivity; and generate, based on the modified first audio intensity level and the second audio intensity level, an audible signal representing a text-to-speech conversion of the data. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. One or more non-transitory computer-readable media having instructions stored thereon that when executed by one or more computers cause the one or more computers to convert text to speech by configuring the one or more computers to:
-
receive data comprising text and a plurality of emoticons; determine a local expressivity corresponding to a group of emoticons of the plurality of emoticons based on a calculation of boundaries of the text, wherein each emoticon of the group of emoticons is located in proximity to a phrase of the text within the boundaries; determine, based on the local expressivity, a first audio intensity level; determine a global expressivity for the data, wherein the global expressivity corresponds to a global multiplier determined after parsing an entire text without the boundaries and the global multiplier modifies the first audio intensity level; determine a second audio intensity level associated with the global expressivity; and generate, based on the modified first audio intensity level and the second audio intensity level, an audible signal representative of text-to-speech conversion of the data. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification