Methods and apparatus for predicting prosody in speech synthesis
First Claim
1. A method comprising:
- comparing an input text to a data set of text fragments to select a corresponding text fragment for at least a portion of the input text, wherein selecting the corresponding text fragment comprisesidentifying within the at least a portion of the input text a first sequence of words beginning with a first function word and including one or more words following the first function word,identifying a grammatical type of the first function word beginning the first sequence of words,constraining the identified first sequence of words within the at least a portion of the input text to be matched as a unit to a contiguous sequence of words in a text fragment in the data set, andselecting as the corresponding text fragment a text fragment including as the contiguous sequence of words a second sequence of words beginning with a second function word that is a different word from the first function word but is of the same grammatical type as the first function word, the corresponding text fragment being associated with spoken audio of at least the second sequence of words, wherein the second sequence of words within the corresponding text fragment includes at least one word not present in the first sequence of words within the at least a portion of the input text;
determining an alignment of the corresponding text fragment with the at least a portion of the input text; and
using a computer, synthesizing speech from the at least a portion of the input text, wherein the synthesizing comprises extracting prosody from the spoken audio of the second sequence of words, including from the at least one word not present in the first sequence of words, and applying the extracted prosody in synthesizing the speech using the alignment of the corresponding text fragment with the at least a portion of the input text.
7 Assignments
0 Petitions
Accused Products
Abstract
Techniques for predicting prosody in speech synthesis may make use of a data set of example text fragments with corresponding aligned spoken audio. To predict prosody for synthesizing an input text, the input text may be compared with the data set of example text fragments to select a best matching sequence of one or more example text fragments, each example text fragment in the sequence being paired with a portion of the input text. The selected example text fragment sequence may be aligned with the input text, e.g., at the word level, such that prosody may be extracted from the audio aligned with the example text fragments, and the extracted prosody may be applied to the synthesis of the input text using the alignment between the input text and the example text fragments.
-
Citations
60 Claims
-
1. A method comprising:
-
comparing an input text to a data set of text fragments to select a corresponding text fragment for at least a portion of the input text, wherein selecting the corresponding text fragment comprises identifying within the at least a portion of the input text a first sequence of words beginning with a first function word and including one or more words following the first function word, identifying a grammatical type of the first function word beginning the first sequence of words, constraining the identified first sequence of words within the at least a portion of the input text to be matched as a unit to a contiguous sequence of words in a text fragment in the data set, and selecting as the corresponding text fragment a text fragment including as the contiguous sequence of words a second sequence of words beginning with a second function word that is a different word from the first function word but is of the same grammatical type as the first function word, the corresponding text fragment being associated with spoken audio of at least the second sequence of words, wherein the second sequence of words within the corresponding text fragment includes at least one word not present in the first sequence of words within the at least a portion of the input text; determining an alignment of the corresponding text fragment with the at least a portion of the input text; and using a computer, synthesizing speech from the at least a portion of the input text, wherein the synthesizing comprises extracting prosody from the spoken audio of the second sequence of words, including from the at least one word not present in the first sequence of words, and applying the extracted prosody in synthesizing the speech using the alignment of the corresponding text fragment with the at least a portion of the input text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A system comprising:
-
at least one memory storing processor-executable instructions; and at least one processor operatively coupled to the at least one memory, the at least one processor being configured to execute the processor-executable instructions to perform a method comprising; comparing an input text to a data set of text fragments to select a corresponding text fragment for at least a portion of the input text, wherein selecting the corresponding text fragment comprises identifying within the at least a portion of the input text a first sequence of words beginning with a first function word and including one or more words following the first function word, identifying a grammatical type of the first function word beginning the first sequence of words, constraining the identified first sequence of words within the at least a portion of the input text to be matched as a unit to a contiguous sequence of words in a text fragment in the data set, and selecting as the corresponding text fragment a text fragment including as the contiguous sequence of words a second sequence of words beginning with a second function word that is a different word from the first function word but is of the same grammatical type as the first function word, the corresponding text fragment being associated with spoken audio of at least the second sequence of words, wherein the second sequence of words within the corresponding text fragment includes at least one word not present in the first sequence of words within the at least a portion of the input text; determining an alignment of the corresponding text fragment with the at least a portion of the input text; and synthesizing speech from the at least a portion of the input text, wherein the synthesizing comprises extracting prosody from the spoken audio of the second sequence of words, including from the at least one word not present in the first sequence of words, and applying the extracted prosody in synthesizing the speech using the alignment of the corresponding text fragment with the at least a portion of the input text. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
-
-
41. At least one non-transitory computer-readable storage medium encoded with a plurality of computer-executable instructions that, when executed, perform a method comprising:
-
comparing an input text to a data set of text fragments to select a corresponding text fragment for at least a portion of the input text, wherein selecting the corresponding text fragment comprises identifying within the at least a portion of the input text a first sequence of words beginning with a first function word and including one or more words following the first function word, identifying a grammatical type of the first function word beginning the first sequence of words, constraining the identified first sequence of words within the at least a portion of the input text to be matched as a unit to a contiguous sequence of words in a text fragment in the data set, and selecting as the corresponding text fragment a text fragment including as the contiguous sequence of words a second sequence of words beginning with a second function word that is a different word from the first function word but is of the same grammatical type as the first function word, the corresponding text fragment being associated with spoken audio of at least the second sequence of words, wherein the second sequence of words within the corresponding text fragment includes at least one word not present in the first sequence of words within the at least a portion of the input text; determining an alignment of the corresponding text fragment with the at least a portion of the input text; and synthesizing speech from the at least a portion of the input text, wherein the synthesizing comprises extracting prosody from the spoken audio of the second sequence of words, including from the at least one word not present in the first sequence of words, and applying the extracted prosody in synthesizing the speech using the alignment of the corresponding text fragment with the at least a portion of the input text. - View Dependent Claims (42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60)
-
Specification