Parametric adaptation of voice synthesis

US 10,586,079 B2
Filed: 01/13/2017
Issued: 03/10/2020
Est. Priority Date: 12/23/2016
Status: Active Grant

First Claim

Patent Images

1. A non-transitory computer readable medium storing code effective to cause one or more processors to:

training a machine learning model according to first behaviors of a listener with respect to first synthesized audio signals including speech synthesized according to a plurality of values for a plurality of TTS parameters, the first behaviors including at least one of opening of an application, clicking on an advertisement, reaction time, and purchasing activity;

apply the machine learning model to determine one or more selected values for the plurality of TTS parameters of a computer-based synthesized voice, the one or more parameters including at least one of level of arousal, authoritativeness, range, flutter, roughness, and breath;

receive input text;

synthesize speech audio for the input text, the synthesis depending on the one or more values for the one or more parameters; and

output the speech audio to the listener.

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Software-based systems perform parametric speech synthesis. TTS voice parameters determine the generated speech audio. Voice parameters include gender, age, dialect, donor, arousal, authoritativeness, pitch, range, speech rate, volume, flutter, roughness, breath, frequencies, bandwidths, and relative amplitudes of formants and nasal sounds. The system chooses TTS parameters based on one or more of: user profile attributes including gender, age, and dialect; situational attributes such as location, noise level, and mood; natural language semantic attributes such as domain of conversation, expression type, dimensions of affect, word emphasis and sentence structure; and analysis of target speaker voices. The system chooses TTS parameters to improve listener satisfaction or other desired listener behavior. Choices may be made by specified algorithms defined by code developers, or by machine learning algorithms trained on labeled samples of system performance.

Citations

25 Claims

1. A non-transitory computer readable medium storing code effective to cause one or more processors to:
- training a machine learning model according to first behaviors of a listener with respect to first synthesized audio signals including speech synthesized according to a plurality of values for a plurality of TTS parameters, the first behaviors including at least one of opening of an application, clicking on an advertisement, reaction time, and purchasing activity;
  
  apply the machine learning model to determine one or more selected values for the plurality of TTS parameters of a computer-based synthesized voice, the one or more parameters including at least one of level of arousal, authoritativeness, range, flutter, roughness, and breath;
  
  receive input text;
  
  synthesize speech audio for the input text, the synthesis depending on the one or more values for the one or more parameters; and
  
  output the speech audio to the listener.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The non-transitory computer readable medium of claim 1, wherein the non- transitory computer readable medium further stores code effective to cause one or more processors to:
    - measure a second behavior responsive to speech synthesized according to the selected values of the plurality of TTS voice parameters;
      
      determine a difference between the first behaviors and the second behavior; and
      
      responsive to the difference, update the machine learning model.
  - 3. The non-transitory computer readable medium of claim 1, wherein the plurality of TTS voice parameters further include at least one of:
    - gender, age, and dialect.
  - 4. The non-transitory computer readable medium of claim 1, wherein the plurality of TTS voice parameters further include at least one of:
    - pitch, speech rate, and volume.
  - 5. The non-transitory computer readable medium of claim 1, wherein the plurality of TTS voice parameters further include at least one of:
    - formant frequency, formant bandwidth, formant amplitude, nasal pole frequency, nasal pole bandwidth, nasal zero frequency, and nasal zero bandwidth.

6. A method for configuring a parameter of a computer-based synthesized voice, the method comprising, by a computing device:
- measuring first behavior of a listener with respect to a plurality of text-to-speech voice parameters by evaluating behavior of the listener in response to first synthesized audio signals that are synthesized according to a plurality of values for the plurality of TTS voice parameters, the plurality of TTS voice parameters including at least one of level of arousal, authoritativeness, range, flutter, roughness, and breath and the first behavior including at least one of opening of an application, clicking on an advertisement, reaction time, and purchasing activity;
  
  training a machine learning model according to the first behavior of the listener with respect to the first synthesized audio signals;
  
  applying the machine learning model to determine selected values for the plurality of TTS voice parameters of a computer-based synthesized voice;
  
  receiving input text;
  
  synthesizing speech audio for the input text, the synthesizing depending on the selected values for the plurality of TTS voice parameters; and
  
  andoutput the speech audio to the listener.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The method of claim 6 further comprising:
    - measuring a second behavior responsive to speech synthesized according to the selected values of the plurality of TTS voice parameters;
      
      determining a difference between the first behavior and the second behavior; and
      
      responsive to the difference, updating the machine learning model.
  - 8. The method of claim 6 wherein the plurality of TTS voice parameters further include at least one of:
    - gender, age, and dialect.
  - 9. The method of claim 6 wherein the plurality of TTS voice parameters further include at least one of:
    - pitch, speech rate, and volume.
  - 10. The method of claim 6 wherein the plurality of TTS voice parameters further include at least one of:
    - formant frequency, formant bandwidth, formant amplitude, nasal pole frequency, nasal pole bandwidth, nasal zero frequency, and nasal zero bandwidth.

11. A non-transitory computer readable medium storing code effective to cause one or more processors to:
- train a machine learning model to relate a plurality of text-to-speech (TTS) voice parameters and listener situations to observed listener behavior;
  
  detect a current value of a situational attribute of a listener, the situational attribute being at least one of age of people present, and music that is playing;
  
  responsive to the value of the situational attribute, determine selected values of the plurality of TTS voice parameters for a computer-based synthesized voice according to the machine learning model, wherein the TTS voice parameter is one of;
  
  TTS voice donor, level of arousal, authoritativeness, range, flutter, roughness, and breath;
  
  synthesize text using the computer-based synthesized voice according to the values of the plurality of TTS voice parameters to obtain speech audio; and
  
  output the speech audio to the listener.

12. A method of configuring a parameter of a computer-based synthesized voice, the method comprising, by a computing device:
- training a machine learning model to relate a plurality of text-to-speech (TTS) voice parameters and listener situations to observed listener behavior;
  
  detecting a current value of a situational attribute of a listener, the situational attribute being at least one of age of people present, and music that is playing; and
  
  responsive to the value of the situational attribute, automatically configuring, according to the machine learning model applied to the value of the situational attribute, a value of a TTS voice parameter of the plurality of TTS voice parameters for the computer-based synthesized voice, wherein the TTS voice parameter is one of;
  
  TTS voice donor, level of arousal, authoritativeness, range, flutter, roughness, and breath;
  
  synthesizing text using the computer-based synthesized voice according to the value of the TTS voice parameter to obtain speech audio; and
  
  outputting the speech audio to the listener.
- View Dependent Claims (13, 14, 15, 16, 17, 18)
- - 13. The method of claim 12 further comprising:
    - measuring a behavior of a listener responsive to speech synthesized according to the value of the TTS voice parameter; and
      
      responsive to the behavior of the listener, updating the model.
  - 14. The method of claim 12 further comprising:
    - measuring a first behavior responsive to speech synthesized according to the value of the TTS voice parameter;
      
      automatically configuring a second value of the TTS voice parameter, according to the situational attribute and the machine learning model, the second value of the TTS voice parameter being different from the value of the TTS voice parameter;
      
      measuring a second behavior responsive to speech synthesized according to the second value of the TTS voice parameter;
      
      determining a difference between the first behavior and the second behavior; and
      
      responsive to the difference, updating the machine learning model.
  - 15. The method of claim 12 wherein the situational attribute is one of:
    - location, noise, mood.
  - 16. The method of claim 12, wherein the plurality of TTS voice parameters further include at least one of:
    - gender, age, and dialect.
  - 17. The method of claim 12, wherein the plurality of TTS voice parameters further include at least one of:
    - pitch, speech rate, and volume.
  - 18. The method of claim 12, wherein the plurality of TTS voice parameters further include at least one of:
    - formant frequency, formant bandwidth, formant amplitude, nasal pole frequency, nasal pole bandwidth, nasal zero frequency, and nasal zero bandwidth.

19. A method of configuring a parameter of a computer-based synthesized voice, the method comprising:
- training a machine learning model to relate a plurality of text-to-speech (TTS) parameters to user behavior and semantic attributes of text synthesized according to first values for the plurality of TTS parameters, the semantic attributes including at least one of a proportion of proper nouns, number of dependent clauses, number of modifiers, emotional charge;
  
  determining a semantic attribute of a natural language expression; and
  
  responsive to the semantic attribute, automatically configuring, according to the machine learning model applied to the semantic attribute, a selected value of a TTS voice parameter of the plurality of TTS parameters for the computer-based synthesized voice, wherein the TTS voice parameter is one of;
  
  TTS voice donor, level of arousal, authoritativeness, range, flutter, roughness, and breath;
  
  converting the natural language expression to speech audio according to the value of the TTS voice parameter; and
  
  outputting the speech audio.
- View Dependent Claims (20, 21, 22, 23, 24, 25)
- - 20. The method of claim 19 further comprising:
    - measuring a behavior of a listener responsive to speech synthesized according to the value of the TTS voice parameter; and
      
      responsive to the behavior of the listener, updating the machine learning model.
  - 21. The method of claim 19 further comprising:
    - measuring a first behavior responsive to speech synthesized according to the value of the TTS voice parameter;
      
      automatically configuring a second value of the TTS voice parameter, according to the machine learning model applied to the semantic attribute, the second value of the TTS voice parameter being different from the value of the TTS voice parameter;
      
      measuring a second behavior responsive to speech synthesized according to the second value of the TTS voice parameter;
      
      determining a difference between the first behavior and the second behavior; and
      
      responsive to the difference, updating the machine learning model.
  - 22. The method of claim 19 wherein the semantic attribute indicates a domain of discourse.
  - 23. The method of claim 19, wherein configuring the plurality of TTS voice parameters further including at least one of:
    - gender, age, and dialect.
  - 24. The method of claim 19, wherein the plurality of TTS voice parameters further include at least one of:
    - pitch, speech rate, volume.
  - 25. The method of claim 19 wherein the plurality of TTS voice parameters further include at least one of:
    - formant frequency, formant bandwidth, formant amplitude, nasal pole frequency, nasal pole bandwidth, nasal zero frequency, and nasal zero bandwidth.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Soundhound AI IP Holding LLC (SoundHound AI, Inc. (f/k/a Archimedes Tech SPAC Partners Co.)), Soundhound AI IP LLC (SoundHound AI, Inc. (f/k/a Archimedes Tech SPAC Partners Co.))
Original Assignee
SoundHound, Inc. (SoundHound AI, Inc. (f/k/a Archimedes Tech SPAC Partners Co.))
Inventors
Almudafar-Depeyrot, Monika, Mont-Reynaud, Bernard
Primary Examiner(s)
Armstrong, Angela A

Application Number

US15/406,213
Publication Number

US 20180182373A1
Time in Patent Office

1,152 Days
Field of Search
US Class Current
CPC Class Codes

G06F 40/30   Semantic analysis

G10L 13/00   Speech synthesis; Text to s...

G10L 13/0335   Pitch control

G10L 13/04   Details of speech synthesis...

G10L 13/10   Prosody rules derived from ...

Parametric adaptation of voice synthesis

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Parametric adaptation of voice synthesis

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links