Dynamic text-to-speech provisioning

US 10,074,359 B2
Filed: 11/01/2016
Issued: 09/11/2018
Est. Priority Date: 11/01/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving, using one or more microphones, an audio signal from a user associated with a user device;

in response to receiving the audio signal from the user, determining whether an application, from among a plurality of applications in the user device, is configured to provide a text-to-speech response;

training one or more classifiers using training data to identify one or more likely voice features of the user in the audio signal, the one or more likely voice features of the user including a likely tone of the voice of the user;

determining, by one or more processors and based on the audio signal received using the one or more microphones, a proximity indicator indicative of a distance between the user and the user device after determining that the application is configured to provide a text-to-speech response;

obtaining, by the one or more processors, data to be audibly output using a computer-synthesized voice;

selecting, by the one or more processors, a tone of voice of the computer-synthesized voice that corresponds to the likely tone of voice of the user identified by the training of the one or more classifiers, and a volume level of the computer-synthesized voice based on the likely tone of voice of the user and the distance between the user and the user device indicated by the proximity indicator;

generating, by the one or more processors, an audio signal based on (i) the data to be audibly output, (ii) the selected tone of voice that corresponds to the likely tone of voice of the user, and (iii) the selected volume level of the computer-synthesized voice; and

providing, by the one or more processors, the generated audio signal for output by one or more speakers.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A dynamic text-to-speech (TTS) process and system are described. In response to receiving a command to provide information to a user, a device retrieves information and determines user and environment attributes including: (i) a distance between the device and the user when the user uttered the query; and (ii) voice features of the user. Based on the user and environment attributes, the device determines a likely mood of the user, and a likely environment in which the user and user device are located in. An audio output template matching the likely mood and voice features of the user is selected. The audio output template is also compatible with the environment in which the user and device are located. The retrieved information is converted into an audio signal using the selected audio output template and output by the device.

29 Citations

View as Search Results

18 Claims

1. A computer-implemented method comprising:
- receiving, using one or more microphones, an audio signal from a user associated with a user device;
  
  in response to receiving the audio signal from the user, determining whether an application, from among a plurality of applications in the user device, is configured to provide a text-to-speech response;
  
  training one or more classifiers using training data to identify one or more likely voice features of the user in the audio signal, the one or more likely voice features of the user including a likely tone of the voice of the user;
  
  determining, by one or more processors and based on the audio signal received using the one or more microphones, a proximity indicator indicative of a distance between the user and the user device after determining that the application is configured to provide a text-to-speech response;
  
  obtaining, by the one or more processors, data to be audibly output using a computer-synthesized voice;
  
  selecting, by the one or more processors, a tone of voice of the computer-synthesized voice that corresponds to the likely tone of voice of the user identified by the training of the one or more classifiers, and a volume level of the computer-synthesized voice based on the likely tone of voice of the user and the distance between the user and the user device indicated by the proximity indicator;
  
  generating, by the one or more processors, an audio signal based on (i) the data to be audibly output, (ii) the selected tone of voice that corresponds to the likely tone of voice of the user, and (iii) the selected volume level of the computer-synthesized voice; and
  
  providing, by the one or more processors, the generated audio signal for output by one or more speakers.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the likely voice features of the user include a likely pitch or frequency of the voice of the user.
  - 3. The method of claim 1, comprising:
    - determining environment attributes; and
      
      determining a type of environment based on the determined environment attributes,wherein the tone of voice of the computer-synthesized voice or the volume level of the computer-synthesized voice is selected further based on the determined type of environment.
  - 4. The method of claim 1, wherein the tone of voice of the computer-synthesized voice is selected to match the likely tone of voice of the user and the volume level of the computer-synthesized voice is selected to match a volume of the user and the distance between the user and the user device indicated by the proximity indicator.
  - 5. The method of claim 1, wherein the tone of voice of the computer-synthesized voice or the volume level of the computer-synthesized voice is further selected based on one or more of:
    - (I) a type of the data to be audibly output, and (II) a type of application used to provide the data to be audibly output.
  - 6. The method of claim 1, wherein determining (i) the voice volume of the user associated with the user device, and (ii) the proximity indicator indicative of the distance between the user and the user device comprises:
    - obtaining audio signal data from a first microphone;
      
      obtaining audio signal data from a second microphone;
      
      obtaining sensor data from one or more sensors, anddetermining a likely location and a likely distance of the user based on the sensor data, audio signal data from the first microphone, and the audio signal data from the second microphone.
  - 7. The method of claim 1,wherein:
    - the one or more voice features include word enunciation and oratory style of the user; and
      
      the training data includes one or more of a pitch, tone, range of frequency, amplitude values, and voice samples associated with particular voice models.
  - 8. The method of claim 1, further comprising:
    - receiving a second audio signal;
      
      identifying one or more voice features in the second audio signal;
      
      determining that the identified one or more voice features in the second audio signal do not match one or more voice features associated with the user; and
      
      providing, for output by the one or more speakers, a query message requesting the user to confirm an instruction included in the second audio signal.

9. One or more non-transitory computer-readable storage media comprising instructions, which, when executed by one or more processors, cause the one or more processors to perform operations comprising:
- receiving, using one or more microphones, an audio signal from a user associated with a user device;
  
  in response to receiving the audio signal from the user, determining whether an application, from among a plurality of applications in the user device, is configured to provide a text-to-speech response;
  
  training one or more classifiers using training data to identify one or more likely voice features of the user in the audio signal, the one or more likely voice features of the user including a likely tone of the voice of the user;
  
  determining, by one or more processors and based on the audio signal received using the one or more microphones, a proximity indicator indicative of a distance between the user and the user device after determining that the application is configured to provide a text-to-speech response;
  
  obtaining, by the one or more processors, data to be audibly output using a computer-synthesized voice;
  
  selecting, by the one or more processors, a tone of voice of the computer-synthesized voice that corresponds to the likely tone of voice of the user identified by the training of the one or more classifiers, and a volume level of the computer-synthesized voice based on the likely tone of voice of the user and the distance between the user and the user device indicated by the proximity indicator;
  
  generating, by the one or more processors, an audio signal based on (i) the data to be audibly output, (ii) the selected tone of voice that corresponds to the likely tone of voice of the user, and (iii) the selected volume level of the computer-synthesized voice; and
  
  providing, by the one or more processors, the generated audio signal for output by one or more speakers.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The media of claim 9, wherein the likely voice features of the user include a likely pitch or frequency of the voice of the user.
  - 11. The media of claim 9, wherein the operations comprise:
    - determining environment attributes; and
      
      determining a type of environment based on the determined environment attributes,wherein the tone of voice of the computer-synthesized voice or the volume level of the computer-synthesized voice is selected further based on the determined type of environment.
  - 12. The media of claim 9, wherein the tone of voice of the computer-synthesized voice is selected to match the likely tone of voice of the user and the volume level of the computer-synthesized voice is selected to match a volume of the user and the distance between the user and the user device indicated by the proximity indicator.
  - 13. The media of claim 9, wherein the tone of voice of the computer-synthesized voice or the volume level of the computer-synthesized voice is further selected based on one or more of:
    - (I) a type of the data to be audibly output, and (II) a type of application used to provide the data to be audibly output.
  - 14. The media of claim 9, wherein determining (i) the voice volume of the user associated with the user device, and (ii) the proximity indicator indicative of the distance between the user and the user device comprises:
    - obtaining audio signal data from a first microphone;
      
      obtaining audio signal data from a second microphone;
      
      obtaining sensor data from one or more sensors; and
      
      determining a likely location and a likely distance of the user based on the sensor data, audio signal data from the first microphone, and the audio signal data from the second microphone.

15. A system comprising:
- one or more processors and one or more storage devices storing instructions which when executed by the one or more processors, cause the one or more processors to perform operations comprising;
  
  receiving, using one or more microphones, an audio signal from a user associated with a user device;
  
  in response to receiving the audio signal from the user, determining whether an application, from among a plurality of applications in the user device, is configured to provide a text-to-speech response;
  
  training one or more classifiers using training data to identify one or more likely voice features of the user in the audio signal, the one or more likely voice features of the user including a likely tone of the voice of the user;
  
  determining, based on the audio signal received using the one or more microphones, a proximity indicator indicative of a distance between the user and the user device, after determining that the application is configured to provide a text-to-speech response;
  
  obtaining data to be audibly output using a computer-synthesized voice;
  
  selecting a tone of voice of the computer-synthesized voice that corresponds to the likely tone of voice of the user identified by the training of the one or more classifiers, and a volume level of the computer-synthesized voice based on the likely tone of voice of the user and the distance between the user and the user device indicated by the proximity indicator;
  
  generating an audio signal based on (i) the data to be audibly output, (ii) the selected tone of voice that corresponds to the likely tone of voice of the user, and (iii) the selected volume level of the computer-synthesized voice; and
  
  providing the generated audio signal for output by one or more speakers.
- View Dependent Claims (16, 17, 18)
- - 16. The system of claim 15, wherein the likely voice features of the user include a likely pitch or frequency of the voice of the user.
  - 17. The system of claim 15, wherein the tone of voice of the computer-synthesized voice or the volume level of the computer-synthesized voice is selected based on one or more of:
    - (I) a type of the data to be output, and (II) a type of application used to provide the data to be audibly output.
  - 18. The system of claim 15, wherein determining (i) the voice volume of the user associated with the user device, and (ii) the proximity indicator indicative of the distance between the user and the user device comprises:
    - obtaining audio signal data from a first microphone;
      
      obtaining audio signal data from a second microphone;
      
      obtaining sensor data from one or more sensors; and
      
      determining a likely location and a likely distance of the user based on the sensor data, audio signal data from the first microphone, and the audio signal data from the second microphone.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Silveira Ocampo, Juan Jose
Primary Examiner(s)
JACKSON, JAKIEDA R

Application Number

US15/340,319
Publication Number

US 20180122361A1
Time in Patent Office

679 Days
Field of Search

740260, 740207, 740205, 740226, 740233, 740270
US Class Current
CPC Class Codes

G10L 13/033   Voice editing, e.g. manipul...

G10L 13/0335   Pitch control

G10L 15/02   Feature extraction for spee...

G10L 15/22   Procedures used during a sp...

G10L 21/0364   for improving intelligibility

G10L 25/48   specially adapted for parti...

G10L 25/63   for estimating an emotional...

Dynamic text-to-speech provisioning

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

29 Citations

18 Claims

Specification

Use Cases

Quick Links

Others

Dynamic text-to-speech provisioning

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

29 Citations

18 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others