Generating prosodic contours for synthesized speech

US 9,093,067 B1
Filed: 11/26/2012
Issued: 07/28/2015
Est. Priority Date: 11/14/2008
Status: Active Grant

First Claim

Patent Images

1. A method implemented by a system of one or more computers, comprising:

receiving, by the system of one or more computers, speech utterances encoded in audio data and a transcript having text that represents the speech utterances;

extracting, by the system of one or more computers, prosodic contours from the utterances;

extracting, by the system of one or more computers and from the transcript, attributes of text associated with the utterances;

for pairs of utterances from the speech utterances, determining, by the system of one or more computers, distances between attributes of text associated with the pairs of utterances;

for the pairs of utterances from the speech utterances, determining, by the system of one or more computers, distances between prosodic contours for the pairs of utterances;

generating, by the system of one or more computers, a model based on the determined distances for the attributes and the prosodic contours, the model adapted to estimate a distance between a determined prosodic contour for a received utterance and a prosodic contour for a synthesized utterance when given a distance between an attribute of text associated with the received utterance and an attribute of text associated with the synthesized utterance; and

storing, by the system of one or more computers, the model in a computer-readable memory device.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The subject matter of this specification can be implemented in a computer-implemented method that includes receiving utterances and transcripts thereof. The method includes analyzing the utterances and transcripts to determine certain attributes, such as distances between prosodic contours for pairs of utterances. A model can be generated that can be used to estimate a distance between a determined prosodic contour for a received utterance and an unknown prosodic contour for a synthesized utterance when given a distance between attributes for text associated with the received utterance and the synthesized utterance.

Citations

21 Claims

1. A method implemented by a system of one or more computers, comprising:
- receiving, by the system of one or more computers, speech utterances encoded in audio data and a transcript having text that represents the speech utterances;
  
  extracting, by the system of one or more computers, prosodic contours from the utterances;
  
  extracting, by the system of one or more computers and from the transcript, attributes of text associated with the utterances;
  
  for pairs of utterances from the speech utterances, determining, by the system of one or more computers, distances between attributes of text associated with the pairs of utterances;
  
  for the pairs of utterances from the speech utterances, determining, by the system of one or more computers, distances between prosodic contours for the pairs of utterances;
  
  generating, by the system of one or more computers, a model based on the determined distances for the attributes and the prosodic contours, the model adapted to estimate a distance between a determined prosodic contour for a received utterance and a prosodic contour for a synthesized utterance when given a distance between an attribute of text associated with the received utterance and an attribute of text associated with the synthesized utterance; and
  
  storing, by the system of one or more computers, the model in a computer-readable memory device.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, further comprising modifying the extracted prosodic contours at a time previous to determining the distances between the extracted prosodic contours.
  - 3. The method of claim 1, wherein extracting the prosodic contours from the utterances comprises generating for each prosodic contour time-value pairs that comprise a prosodic contour value and a time at which the prosodic contour value occurs.
  - 4. The method of claim 1, wherein the extracted prosodic contours comprise fundamental frequencies, pitches, energy measurements, gain measurements, duration measurements, intensity measurements, measurements of rate of speech, or spectral tilt measurements.
  - 5. The method of claim 1, wherein the extracted attributes comprise exact stress patterns, canonical stress patterns, parts of speech, phone representations, phoneme representations, or indications of declaration versus question versus exclamation.
  - 6. The method of claim 1, further comprising aligning the utterances in the audio data with text, from the transcripts, that represents the utterances to determine which speech utterances are associated with which text.
  - 7. The method of claim 1, wherein generating the model comprises mapping the distances between the attributes of text associated with the pairs of utterances to the distances between the prosodic contours for the pairs of utterances in order to determine a relationship between the distances associated with the attributes of the text and the distances associated with the prosodic contours for pairs of utterances.
  - 8. The method of claim 1, wherein the distances between the prosodic contours are calculated using a root mean square difference calculation.
  - 9. The method of claim 1, wherein the model is created using a linear regression of the distances between the prosodic contours and the distances between the transcripts.
  - 10. The method of claim 1, further comprising selecting pairs of utterances for use in determining distances based on whether the utterances have canonical stress patterns that match.
  - 11. The method of claim 1, comprising creating multiple models, including the model, where each of the models has a different canonical stress pattern.
  - 12. The method of claim 1, further comprising selecting, based on estimated distances between a plurality of determined prosodic contours and a prosodic contour of text to be synthesized, a final determined prosodic contour associated with a smallest distance.
  - 13. The method of claim 12, further comprising generating a prosodic contour for the text to be synthesized using the final determined prosodic contour.
  - 14. The method of claim 13, further comprising outputting the generated prosodic contour and the text to be synthesized to a speech-to-text engine for speech synthesis.

15. A computer-implemented system, comprising:
- one or more computers having;
  
  an interface to receive speech utterances encoded in audio data and a transcript having text that represents the speech utterances;
  
  a prosodic contour extractor to extract prosodic contours from the utterances;
  
  a transcript analyzer to extract attributes of text associated with the utterances;
  
  an attribute comparer to determine, for pairs of utterances from the speech utterances, distances between attributes of text associated with the pairs of utterances;
  
  a prosodic contour comparer to determine, for the pairs of utterances from the speech utterances, distances between prosodic contours for the pairs of utterances;
  
  a model generator programmed to generate a model based on the determined distances for the attributes and the prosodic contours, the model adapted to estimate a distance between a determined prosodic contour for a received utterance and a prosodic contour for a synthesized utterance when given a distance between an attribute of text associated with the received utterance and an attribute of text associated with the synthesized utterance; and
  
  a computer-readable memory device associated with the one or more computers to store the model.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The system of claim 15, wherein the system is further programmed to modify the extracted prosodic contours at a time previous to determining the distances between the extracted prosodic contours.
  - 17. The system of claim 15, wherein extracting the prosodic contours from the utterances comprises generating for each prosodic contour time-value pairs that comprise a prosodic contour value and a time at which the prosodic contour value occurs.
  - 18. The system of claim 15, wherein the extracted prosodic contours comprise fundamental frequencies, pitches, energy measurements, gain measurements, duration measurements, intensity measurements, measurements of rate of speech, or spectral tilt measurements.
  - 19. The system of claim 15, wherein the extracted attributes comprise exact stress patterns, canonical stress patterns, parts of speech, phone representations, phoneme representations, or indications of declaration versus question versus exclamation.
  - 20. The system of claim 15, wherein the system is further programmed to align the utterances in the audio data with text, from the transcripts, that represents the utterances to determine which speech utterances are associated with which text.
  - 21. The system of claim 15, wherein generating the model comprises mapping the distances between the attributes of text associated with the pairs of utterances to the distances between the prosodic contours for the pairs of utterances in order to determine a relationship between the distances associated with the attributes of the text and the distances associated with the prosodic contours for pairs of utterances.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Jansche, Martin, Riley, Michael D., Rosenberg, Andrew M., Tai, Terry
Primary Examiner(s)
Han, Qi

Application Number

US13/685,228
Time in Patent Office

974 Days
Field of Search

704/260, 704/258, 704/261, 704/263, 704/264
US Class Current

1/1
CPC Class Codes

G10L 13/027 Concept to speech synthesis...

G10L 13/10 Prosody rules derived from ...

Generating prosodic contours for synthesized speech

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Generating prosodic contours for synthesized speech

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links