Parametric adaptation of voice synthesis
First Claim
1. A non-transitory computer readable medium storing code effective to cause one or more processors to:
- training a machine learning model according to first behaviors of a listener with respect to first synthesized audio signals including speech synthesized according to a plurality of values for a plurality of TTS parameters, the first behaviors including at least one of opening of an application, clicking on an advertisement, reaction time, and purchasing activity;
apply the machine learning model to determine one or more selected values for the plurality of TTS parameters of a computer-based synthesized voice, the one or more parameters including at least one of level of arousal, authoritativeness, range, flutter, roughness, and breath;
receive input text;
synthesize speech audio for the input text, the synthesis depending on the one or more values for the one or more parameters; and
output the speech audio to the listener.
11 Assignments
0 Petitions
Accused Products
Abstract
Software-based systems perform parametric speech synthesis. TTS voice parameters determine the generated speech audio. Voice parameters include gender, age, dialect, donor, arousal, authoritativeness, pitch, range, speech rate, volume, flutter, roughness, breath, frequencies, bandwidths, and relative amplitudes of formants and nasal sounds. The system chooses TTS parameters based on one or more of: user profile attributes including gender, age, and dialect; situational attributes such as location, noise level, and mood; natural language semantic attributes such as domain of conversation, expression type, dimensions of affect, word emphasis and sentence structure; and analysis of target speaker voices. The system chooses TTS parameters to improve listener satisfaction or other desired listener behavior. Choices may be made by specified algorithms defined by code developers, or by machine learning algorithms trained on labeled samples of system performance.
-
Citations
25 Claims
-
1. A non-transitory computer readable medium storing code effective to cause one or more processors to:
-
training a machine learning model according to first behaviors of a listener with respect to first synthesized audio signals including speech synthesized according to a plurality of values for a plurality of TTS parameters, the first behaviors including at least one of opening of an application, clicking on an advertisement, reaction time, and purchasing activity; apply the machine learning model to determine one or more selected values for the plurality of TTS parameters of a computer-based synthesized voice, the one or more parameters including at least one of level of arousal, authoritativeness, range, flutter, roughness, and breath; receive input text; synthesize speech audio for the input text, the synthesis depending on the one or more values for the one or more parameters; and output the speech audio to the listener. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for configuring a parameter of a computer-based synthesized voice, the method comprising, by a computing device:
-
measuring first behavior of a listener with respect to a plurality of text-to-speech voice parameters by evaluating behavior of the listener in response to first synthesized audio signals that are synthesized according to a plurality of values for the plurality of TTS voice parameters, the plurality of TTS voice parameters including at least one of level of arousal, authoritativeness, range, flutter, roughness, and breath and the first behavior including at least one of opening of an application, clicking on an advertisement, reaction time, and purchasing activity; training a machine learning model according to the first behavior of the listener with respect to the first synthesized audio signals; applying the machine learning model to determine selected values for the plurality of TTS voice parameters of a computer-based synthesized voice; receiving input text; synthesizing speech audio for the input text, the synthesizing depending on the selected values for the plurality of TTS voice parameters; and and output the speech audio to the listener. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A non-transitory computer readable medium storing code effective to cause one or more processors to:
-
train a machine learning model to relate a plurality of text-to-speech (TTS) voice parameters and listener situations to observed listener behavior; detect a current value of a situational attribute of a listener, the situational attribute being at least one of age of people present, and music that is playing; responsive to the value of the situational attribute, determine selected values of the plurality of TTS voice parameters for a computer-based synthesized voice according to the machine learning model, wherein the TTS voice parameter is one of;
TTS voice donor, level of arousal, authoritativeness, range, flutter, roughness, and breath;synthesize text using the computer-based synthesized voice according to the values of the plurality of TTS voice parameters to obtain speech audio; and output the speech audio to the listener.
-
-
12. A method of configuring a parameter of a computer-based synthesized voice, the method comprising, by a computing device:
-
training a machine learning model to relate a plurality of text-to-speech (TTS) voice parameters and listener situations to observed listener behavior; detecting a current value of a situational attribute of a listener, the situational attribute being at least one of age of people present, and music that is playing; and responsive to the value of the situational attribute, automatically configuring, according to the machine learning model applied to the value of the situational attribute, a value of a TTS voice parameter of the plurality of TTS voice parameters for the computer-based synthesized voice, wherein the TTS voice parameter is one of;
TTS voice donor, level of arousal, authoritativeness, range, flutter, roughness, and breath;synthesizing text using the computer-based synthesized voice according to the value of the TTS voice parameter to obtain speech audio; and outputting the speech audio to the listener. - View Dependent Claims (13, 14, 15, 16, 17, 18)
-
-
19. A method of configuring a parameter of a computer-based synthesized voice, the method comprising:
-
training a machine learning model to relate a plurality of text-to-speech (TTS) parameters to user behavior and semantic attributes of text synthesized according to first values for the plurality of TTS parameters, the semantic attributes including at least one of a proportion of proper nouns, number of dependent clauses, number of modifiers, emotional charge; determining a semantic attribute of a natural language expression; and responsive to the semantic attribute, automatically configuring, according to the machine learning model applied to the semantic attribute, a selected value of a TTS voice parameter of the plurality of TTS parameters for the computer-based synthesized voice, wherein the TTS voice parameter is one of;
TTS voice donor, level of arousal, authoritativeness, range, flutter, roughness, and breath;converting the natural language expression to speech audio according to the value of the TTS voice parameter; and outputting the speech audio. - View Dependent Claims (20, 21, 22, 23, 24, 25)
-
Specification