Generating prosodic contours for synthesized speech

US 8,321,225 B1
Filed: 11/14/2008
Issued: 11/27/2012
Est. Priority Date: 11/14/2008
Status: Expired due to Fees

First Claim

Patent Images

1. A method implemented by a system of one or more computers, comprising:

receiving, at the system, text to be synthesized as a spoken utterance;

analyzing, by the system, the received text to determine attributes of the received text;

selecting, by the system, one or more candidate utterances from a database of stored utterances based on a comparison between the determined attributes of the received text and corresponding attributes of text representing the stored utterances;

determining, by the system for each candidate utterance, a distance between a prosodic contour of the candidate utterance and a hypothetical prosodic contour of the spoken utterance to be synthesized, the determination based on a model that relatesa) distances between prosodic contours of pairs of the stored utterances tob) relationships between attributes of text of each of the respective pairs,wherein the model is embodied by information including, for each of the stored utterances;

a prosodic contour of the respective stored utterance,one or more attributes of text of the respective stored utterance, andfirst data relatinga difference between the prosodic contour of the respective stored utterance to the prosodic contour of a second stored utterance toa difference between a first attribute of the text of the respective stored utterance and the first attribute of the text of the second stored utterance,second data relatinga difference between the prosodic contour of the respective stored utterance to the prosodic contour of a third stored utterance toa difference between the first attribute of the text of the respective stored utterance and the first attribute of the text of the third stored utterance,wherein the second stored utterance and the third stored utterance are in the stored utterances, andwherein prosodic contours represent prosodic characteristics of speech at different times;

selecting, by the system, a final candidate utterance having a prosodic contour with a closest distance to the hypothetical prosodic contour; and

generating, by the system, a prosodic contour for the text to be synthesized based on the contour of the final candidate utterance.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The subject matter of this specification can be implemented in, among other things, a computer-implemented method including receiving text to be synthesized as a spoken utterance. The method includes analyzing the received text to determine attributes of the received text and selecting one or more utterances from a database based on a comparison between the attributes of the received text and attributes of text representing the stored utterances. The method includes determining, for each utterance, a distance between a contour of the utterance and a hypothetical contour of the spoken utterance, the determination based on a model that relates distances between pairs of contours of the utterances to relationships between attributes of text for the pairs. The method includes selecting a final utterance having a contour with a closest distance to the hypothetical contour and generating a contour for the received text based on the contour of the final utterance.

59 Citations

View as Search Results

34 Claims

1. A method implemented by a system of one or more computers, comprising:
- receiving, at the system, text to be synthesized as a spoken utterance;
  
  analyzing, by the system, the received text to determine attributes of the received text;
  
  selecting, by the system, one or more candidate utterances from a database of stored utterances based on a comparison between the determined attributes of the received text and corresponding attributes of text representing the stored utterances;
  
  determining, by the system for each candidate utterance, a distance between a prosodic contour of the candidate utterance and a hypothetical prosodic contour of the spoken utterance to be synthesized, the determination based on a model that relatesa) distances between prosodic contours of pairs of the stored utterances tob) relationships between attributes of text of each of the respective pairs,wherein the model is embodied by information including, for each of the stored utterances;
  
  a prosodic contour of the respective stored utterance,one or more attributes of text of the respective stored utterance, andfirst data relatinga difference between the prosodic contour of the respective stored utterance to the prosodic contour of a second stored utterance toa difference between a first attribute of the text of the respective stored utterance and the first attribute of the text of the second stored utterance,second data relatinga difference between the prosodic contour of the respective stored utterance to the prosodic contour of a third stored utterance toa difference between the first attribute of the text of the respective stored utterance and the first attribute of the text of the third stored utterance,wherein the second stored utterance and the third stored utterance are in the stored utterances, andwherein prosodic contours represent prosodic characteristics of speech at different times;
  
  selecting, by the system, a final candidate utterance having a prosodic contour with a closest distance to the hypothetical prosodic contour; and
  
  generating, by the system, a prosodic contour for the text to be synthesized based on the contour of the final candidate utterance.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 2. The method of claim 1, wherein the relationships between attributes of text for the pairs include an edit distance between each of the pairs.
  - 3. The method of claim 1, further comprising selecting, by the system, a plurality of final candidate utterances having distances that satisfy a threshold and generating the prosodic contour for the text to be synthesized based on a combination of the prosodic contours of the plurality of final candidate utterances.
  - 4. The method of claim 1, further comprising selecting, by the system, k final candidate utterances having the closest distances and generating the prosodic contour for the text to be synthesized based on a combination of the prosodic contours of the k final candidate utterances, wherein k represents a positive integer.
  - 5. The method of claim 4, wherein the k final candidate utterances are combined by averaging the prosodic contours of the k final candidate utterances.
  - 6. The method of claim 4, further comprising rescaling and warping, by the system, the prosodic contour generated from the combination to match the received text to be synthesized as the spoken utterance.
  - 7. The method of claim 1, wherein the determined attributes of the received text include an aggregate attribute.
  - 8. The method of claim 7, wherein the aggregate attribute includes a number of stressed syllables in the received text.
  - 9. The method of claim 1, further comprising aligning, by the system, the generated prosodic contour with the received text to be synthesized.
  - 10. The method of claim 9, further comprising outputting, from the system, the received text to be synthesized with the aligned generated prosodic contour to a text-to-speech engine for speech synthesis.
  - 11. The method of claim 9, wherein aligning the generated prosodic contour includes rescaling an unstressed portion of the generated prosodic contour to a longer or a shorter length.
  - 12. The method of claim 9, wherein aligning the generated prosodic contour includes removing an unstressed portion from the generated prosodic contour.
  - 13. The method of claim 9, wherein aligning the generated prosodic contour includes adding an unstressed portion to the generated prosodic contour.
  - 14. The method of claim 1, wherein the determined attributes of the received text include an indication of whether or not the received text begins with a stressed portion.
  - 15. The method of claim 1, wherein the determined attributes of the received text include an indication of whether or not the received text ends with a stressed portion.
  - 16. The method of claim 1, wherein selecting the one or more candidate utterances includes selecting utterances from the database that have lexical stress patterns that substantially match lexical stress patterns of the received text.
  - 17. The method of claim 16, wherein the lexical stress patterns comprise exact lexical stress patterns or canonical lexical stress patterns.
  - 18. The method of claim 1, wherein the model embodies relationships ofa) root mean square differences between prosodic contours of pairs of the stored utterances tob) the relationships between the attributes of text for the respective pairs.
  - 19. The method of claim 1, wherein the model embodies relationships ofa) root mean square differences between pitch values of prosodic contours of pairs of the stored utterances tob) the relationships between the attributes of text for the respective pairs.
  - 20. The method of claim 1, wherein the model embodies relationships between all prosodic contours in the database of stored utterances and the relationships between the attributes of text of the respective pairs.
  - 21. The method of claim 1, wherein the model embodies relationships between a random sample of prosodic contours in the database of stored utterances and the relationships between the attributes of text of the respective pairs in the random sample.
  - 22. The method of claim 1, wherein the model embodies relationships between a sample of the most frequently used prosodic contours in the database of stored utterances and the relationships between the attributes of text of the respective pairs in the sample.

23. A computer-implemented system comprising:
- one or more computers having;
  
  an interface to receive text to be synthesized as a spoken utterance;
  
  a text analyzer to analyze the received text to determine attributes of the received text;
  
  a candidate identifier to select one or more candidate utterances from a database of stored utterances based on a comparison between the determined attributes of the received text and corresponding attributes of text representing the stored utterances;
  
  means for determining a distance between a prosodic contour of a candidate utterance and a hypothetical prosodic contour of the spoken utterance to be synthesized, the determination based on a model that relatesa) distances between prosodic contours of pairs of the stored utterances tob) distances between attributes of text of each of the respective pairs and for selecting a final candidate utterance having a prosodic contour with a closest distance to the hypothetical prosodic contour, wherein prosodic contours represent prosodic characteristics of speech at different times; and
  
  a prosodic contour aligner to generate a prosodic contour for the text to be synthesized based on the prosodic contour of the final candidate utterance;
  
  wherein the system further comprises a memory for storing data for access by the means for determining the distance, the memory comprising information embodying the model used by the means for determining the distance, the information including, for each of the stored utterances;
  
  a prosodic contour of the respective stored utterance,one or more attributes of text of the respective stored utterance, andfirst data relatinga difference between the prosodic contour of the respective stored utterance to the prosodic contour of a second stored utterance toa difference between a first attribute of the text of the respective stored utterance and the first attribute of the text of the second stored utterance, andsecond data relatinga difference between the prosodic contour of the respective stored utterance to the prosodic contour of a third stored utterance toa difference between the first attribute of the text of the respective stored utterance and the first attribute of the text of the third stored utterance,wherein the second stored utterance and the third stored utterance are in the stored utterances.
- View Dependent Claims (24, 25, 26, 27, 28)
- - 24. The system of claim 23, wherein the system is programmed to select a plurality of final candidate utterances that have distances that satisfy a threshold and to generate the prosodic contour for the text to be synthesized based on a combination of the prosodic contours of the plurality of final candidate utterances.
  - 25. The system of claim 23, wherein the system is programmed to select k final candidate utterances that have the closest distances and to generate the prosodic contour for the text to be synthesized based on a combination of the prosodic contours of the k final candidate utterances, wherein k represents a positive integer.
  - 26. The system of claim 23, wherein the system is further programmed to align the generated prosodic contour with the received text to be synthesized.
  - 27. The system of claim 26, wherein aligning the generated prosodic contour includes rescaling an unstressed portion of the generated prosodic contour to a longer or a shorter length.
  - 28. The system of claim 23, wherein selecting the one or more candidate utterances includes selecting utterances from the database that have lexical stress patterns that substantially match lexical stress patterns of the received text.

29. A computer-implemented system comprising:
- a computer interface arranged to receive text to be synthesized as a spoken utterance;
  
  a text analyzer to analyze the received text to determine attributes of the received text;
  
  a candidate identifier to select one or more candidate utterances from a database of stored utterances based on a comparison between the determined attributes of the received text and corresponding attributes of text representing the stored utterances;
  
  a candidate selector to determine distances between respective prosodic contours of a candidate utterance and the spoken utterance using a model that relatesa) distances between respective prosodic contours of pairs of the stored utterances tob) distances between attributes of text of each of the respective pairs, and to select a final candidate utterance based on the determined distances; and
  
  a memory for storing data for access by the candidate selector, the memory comprising information embodying the model used by the candidate selector, the information including, for each of the stored utterances;
  
  a prosodic contour of the respective stored utterance,one or more attributes of text of the respective stored utterance, andfirst data relatinga difference between the prosodic contour of the respective stored utterance to the prosodic contour of a second stored utterance toa difference between a first attribute of the text of the respective stored utterance and the first attribute of the text of the second stored utterance,second data relatinga difference between the prosodic contour of the respective stored utterance to the prosodic contour of a third stored utterance toa difference between the first attribute of the text of the respective stored utterance and the first attribute of the text of the third stored utterance,wherein the second stored utterance and the third stored utterance are in the stored utterances,wherein prosodic contours represent prosodic characteristics of speech at different times.
- View Dependent Claims (30, 31, 32, 33, 34)
- - 30. The system of claim 29, further comprising a prosodic contour aligner to generate a prosodic contour for the text to be synthesized based on the prosodic contour of the final candidate utterance.
  - 31. The system of claim 30, wherein aligning the generated prosodic contour includes rescaling an unstressed portion of the generated prosodic contour to a longer or a shorter length.
  - 32. The system of claim 29, wherein the candidate selector is programmed to (a) select a plurality of final candidate utterances that have distances that satisfy a threshold, and (b) generate the prosodic contour for the text to be synthesized based on a combination of the prosodic contours of the plurality of final candidate utterances.
  - 33. The system of claim 29, wherein the candidate selector is programmed to select k final candidate utterances that have the closest distances and to generate the prosodic contour for the text to be synthesized based on a combination of the prosodic contours of the k final candidate utterances, wherein k represents a positive integer.
  - 34. The system of claim 29, wherein selecting the one or more candidate utterances includes selecting utterances from the database that have lexical stress patterns that substantially match lexical stress patterns of the received text.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Jansche, Martin, Riley, Michael D., Rosenberg, Andrew M., Tai, Terry
Primary Examiner(s)
Han, Qi

Application Number

US12/271,568
Time in Patent Office

1,474 Days
Field of Search

704/263, 704/258, 704/261, 704/264, 794/263, 794/258, 794/261, 794/264, 794/260
US Class Current

704/263
CPC Class Codes

G10L 13/027 Concept to speech synthesis...

G10L 13/10 Prosody rules derived from ...

Generating prosodic contours for synthesized speech

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

59 Citations

34 Claims

Specification

Solutions

Use Cases

Quick Links

Generating prosodic contours for synthesized speech

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

59 Citations

34 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links