Generating prosodic contours for synthesized speech
First Claim
1. A method implemented by a system of one or more computers, comprising:
- receiving, at the system, text to be synthesized as a spoken utterance;
analyzing, by the system, the received text to determine attributes of the received text;
selecting, by the system, one or more candidate utterances from a database of stored utterances based on a comparison between the determined attributes of the received text and corresponding attributes of text representing the stored utterances;
determining, by the system for each candidate utterance, a distance between a prosodic contour of the candidate utterance and a hypothetical prosodic contour of the spoken utterance to be synthesized, the determination based on a model that relatesa) distances between prosodic contours of pairs of the stored utterances tob) relationships between attributes of text of each of the respective pairs,wherein the model is embodied by information including, for each of the stored utterances;
a prosodic contour of the respective stored utterance,one or more attributes of text of the respective stored utterance, andfirst data relatinga difference between the prosodic contour of the respective stored utterance to the prosodic contour of a second stored utterance toa difference between a first attribute of the text of the respective stored utterance and the first attribute of the text of the second stored utterance,second data relatinga difference between the prosodic contour of the respective stored utterance to the prosodic contour of a third stored utterance toa difference between the first attribute of the text of the respective stored utterance and the first attribute of the text of the third stored utterance,wherein the second stored utterance and the third stored utterance are in the stored utterances, andwherein prosodic contours represent prosodic characteristics of speech at different times;
selecting, by the system, a final candidate utterance having a prosodic contour with a closest distance to the hypothetical prosodic contour; and
generating, by the system, a prosodic contour for the text to be synthesized based on the contour of the final candidate utterance.
2 Assignments
0 Petitions
Accused Products
Abstract
The subject matter of this specification can be implemented in, among other things, a computer-implemented method including receiving text to be synthesized as a spoken utterance. The method includes analyzing the received text to determine attributes of the received text and selecting one or more utterances from a database based on a comparison between the attributes of the received text and attributes of text representing the stored utterances. The method includes determining, for each utterance, a distance between a contour of the utterance and a hypothetical contour of the spoken utterance, the determination based on a model that relates distances between pairs of contours of the utterances to relationships between attributes of text for the pairs. The method includes selecting a final utterance having a contour with a closest distance to the hypothetical contour and generating a contour for the received text based on the contour of the final utterance.
59 Citations
34 Claims
-
1. A method implemented by a system of one or more computers, comprising:
-
receiving, at the system, text to be synthesized as a spoken utterance; analyzing, by the system, the received text to determine attributes of the received text; selecting, by the system, one or more candidate utterances from a database of stored utterances based on a comparison between the determined attributes of the received text and corresponding attributes of text representing the stored utterances; determining, by the system for each candidate utterance, a distance between a prosodic contour of the candidate utterance and a hypothetical prosodic contour of the spoken utterance to be synthesized, the determination based on a model that relates a) distances between prosodic contours of pairs of the stored utterances to b) relationships between attributes of text of each of the respective pairs, wherein the model is embodied by information including, for each of the stored utterances; a prosodic contour of the respective stored utterance, one or more attributes of text of the respective stored utterance, and first data relating a difference between the prosodic contour of the respective stored utterance to the prosodic contour of a second stored utterance to a difference between a first attribute of the text of the respective stored utterance and the first attribute of the text of the second stored utterance, second data relating a difference between the prosodic contour of the respective stored utterance to the prosodic contour of a third stored utterance to a difference between the first attribute of the text of the respective stored utterance and the first attribute of the text of the third stored utterance, wherein the second stored utterance and the third stored utterance are in the stored utterances, and wherein prosodic contours represent prosodic characteristics of speech at different times; selecting, by the system, a final candidate utterance having a prosodic contour with a closest distance to the hypothetical prosodic contour; and generating, by the system, a prosodic contour for the text to be synthesized based on the contour of the final candidate utterance. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A computer-implemented system comprising:
-
one or more computers having; an interface to receive text to be synthesized as a spoken utterance; a text analyzer to analyze the received text to determine attributes of the received text; a candidate identifier to select one or more candidate utterances from a database of stored utterances based on a comparison between the determined attributes of the received text and corresponding attributes of text representing the stored utterances; means for determining a distance between a prosodic contour of a candidate utterance and a hypothetical prosodic contour of the spoken utterance to be synthesized, the determination based on a model that relates a) distances between prosodic contours of pairs of the stored utterances to b) distances between attributes of text of each of the respective pairs and for selecting a final candidate utterance having a prosodic contour with a closest distance to the hypothetical prosodic contour, wherein prosodic contours represent prosodic characteristics of speech at different times; and a prosodic contour aligner to generate a prosodic contour for the text to be synthesized based on the prosodic contour of the final candidate utterance; wherein the system further comprises a memory for storing data for access by the means for determining the distance, the memory comprising information embodying the model used by the means for determining the distance, the information including, for each of the stored utterances; a prosodic contour of the respective stored utterance, one or more attributes of text of the respective stored utterance, and first data relating a difference between the prosodic contour of the respective stored utterance to the prosodic contour of a second stored utterance to a difference between a first attribute of the text of the respective stored utterance and the first attribute of the text of the second stored utterance, and second data relating a difference between the prosodic contour of the respective stored utterance to the prosodic contour of a third stored utterance to a difference between the first attribute of the text of the respective stored utterance and the first attribute of the text of the third stored utterance, wherein the second stored utterance and the third stored utterance are in the stored utterances. - View Dependent Claims (24, 25, 26, 27, 28)
-
-
29. A computer-implemented system comprising:
-
a computer interface arranged to receive text to be synthesized as a spoken utterance; a text analyzer to analyze the received text to determine attributes of the received text; a candidate identifier to select one or more candidate utterances from a database of stored utterances based on a comparison between the determined attributes of the received text and corresponding attributes of text representing the stored utterances; a candidate selector to determine distances between respective prosodic contours of a candidate utterance and the spoken utterance using a model that relates a) distances between respective prosodic contours of pairs of the stored utterances to b) distances between attributes of text of each of the respective pairs, and to select a final candidate utterance based on the determined distances; and a memory for storing data for access by the candidate selector, the memory comprising information embodying the model used by the candidate selector, the information including, for each of the stored utterances; a prosodic contour of the respective stored utterance, one or more attributes of text of the respective stored utterance, and first data relating a difference between the prosodic contour of the respective stored utterance to the prosodic contour of a second stored utterance to a difference between a first attribute of the text of the respective stored utterance and the first attribute of the text of the second stored utterance, second data relating a difference between the prosodic contour of the respective stored utterance to the prosodic contour of a third stored utterance to a difference between the first attribute of the text of the respective stored utterance and the first attribute of the text of the third stored utterance, wherein the second stored utterance and the third stored utterance are in the stored utterances, wherein prosodic contours represent prosodic characteristics of speech at different times. - View Dependent Claims (30, 31, 32, 33, 34)
-
Specification