Predicting pronunciations with word stress
First Claim
1. A method performed by one or more computers of a text-to-speech synthesis system, the method comprising:
- determining, by the one or more computers of the text-to-speech synthesis system, spelling data that indicates the spelling of a word;
determining, by the one or more computers of the text-to-speech synthesis system, first pronunciation data that indicates at least one stress location for the word;
providing, by the one or more computers of the text-to-speech synthesis system, the spelling data and the first pronunciation data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words;
receiving, by the one or more computers of the text-to-speech synthesis system, output representing a stress pattern for pronunciation of the word, the output being generated by the trained recurrent neural network in response to providing the spelling data and the first pronunciation data as input;
using, by the one or more computers of the text-to-speech synthesis system, the output of the trained recurrent neural network to generate second pronunciation data indicating a stress pattern for a pronunciation of the word, wherein the second pronunciation data is different from the first pronunciation data that indicates at least one stress location for the word;
generating, using the second pronunciation data, audio data that includes a synthesized utterance of the word and applies stress to the word based on the stress pattern indicated by the second pronunciation data; and
providing, by the one or more computers of the text-to-speech synthesis system, the audio data to a system that includes at least one speaker for audible presentation of the synthesized utterance of the word using the audio data.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating word pronunciations. One of the methods includes determining, by one or more computers, spelling data that indicates the spelling of a word, providing the spelling data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words, receiving output indicating a stress pattern for pronunciation of the word generated by the trained recurrent neural network in response to providing the spelling data as input, using the output of the trained recurrent neural network to generate pronunciation data indicating the stress pattern for a pronunciation of the word, and providing, by the one or more computers, the pronunciation data to a text-to-speech system or an automatic speech recognition system.
36 Citations
19 Claims
-
1. A method performed by one or more computers of a text-to-speech synthesis system, the method comprising:
-
determining, by the one or more computers of the text-to-speech synthesis system, spelling data that indicates the spelling of a word; determining, by the one or more computers of the text-to-speech synthesis system, first pronunciation data that indicates at least one stress location for the word; providing, by the one or more computers of the text-to-speech synthesis system, the spelling data and the first pronunciation data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words; receiving, by the one or more computers of the text-to-speech synthesis system, output representing a stress pattern for pronunciation of the word, the output being generated by the trained recurrent neural network in response to providing the spelling data and the first pronunciation data as input; using, by the one or more computers of the text-to-speech synthesis system, the output of the trained recurrent neural network to generate second pronunciation data indicating a stress pattern for a pronunciation of the word, wherein the second pronunciation data is different from the first pronunciation data that indicates at least one stress location for the word; generating, using the second pronunciation data, audio data that includes a synthesized utterance of the word and applies stress to the word based on the stress pattern indicated by the second pronunciation data; and providing, by the one or more computers of the text-to-speech synthesis system, the audio data to a system that includes at least one speaker for audible presentation of the synthesized utterance of the word using the audio data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A text-to-speech synthesis system comprising:
-
one or more data processing apparatus; and one or more non-transitory computer readable storage media in data communication with the one or more data processing apparatus and storing instructions executable by the one or more data processing apparatus and upon such execution cause the one or more data processing apparatus to perform operations comprising; determining spelling data that indicates the spelling of a word; determining first pronunciation data that indicates at least one stress location for the word; providing the spelling data and the first pronunciation data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words; receiving output representing a stress pattern for pronunciation of the word, the output being generated by the trained recurrent neural network in response to providing the spelling data and the first pronunciation data as input; using the output of the trained recurrent neural network to generate second pronunciation data indicating a stress pattern for a pronunciation of the word, wherein the second pronunciation data is different from the first pronunciation data that indicates at least one stress location for the word; generating, using the second pronunciation data, audio data that includes a synthesized utterance of the word and applies stress to the word based on the stress pattern indicated by the second pronunciation data; and providing, as output of the text-to-speech synthesis system, the audio data to a system that includes at least one speaker for audible presentation of the synthesized utterance of the word using the audio data. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. One or more non-transitory computer readable storage media storing instructions executable by one or more data processing apparatus of a text-to-speech synthesis system and upon such execution cause the one or more data processing apparatus to perform operations comprising:
-
determining spelling data that indicates the spelling of a word; determining first pronunciation data that indicates at least one stress location for the word; providing the spelling data and the first pronunciation data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words; receiving output representing a stress pattern for pronunciation of the word, the output being generated by the trained recurrent neural network in response to providing the spelling data and the first pronunciation data as input; using the output of the trained recurrent neural network to generate second pronunciation data indicating a stress pattern for a pronunciation of the word, wherein the second pronunciation data is different from the first pronunciation data that indicates at least one stress location for the word; generating, using the second pronunciation data, audio data that includes a synthesized utterance of the word and applies stress to the word based on the stress pattern indicated by the second pronunciation data; and providing, as output of the text-to-speech synthesis system, the audio data to a system that includes at least one speaker for audible presentation of the synthesized utterance of the word using the audio data. - View Dependent Claims (18, 19)
-
Specification