Context-aware unit selection
First Claim
1. A machine-implemented method of text-to-speech generation, comprising:
- at a device comprising one or more processors and memory;
receiving a text input to be converted to speech, the text input including a sequence of text input units; and
for each text input unit of the sequence of text input units;
selecting, from a pool of pre-recorded segments of speech, a respective plurality of candidate speech units for the text input unit, wherein the respective plurality of candidate speech units differ from one another in regard to one or more of a plurality of characteristics;
for each of the plurality of characteristics, determining a respective degree of variation present among the respective plurality of candidate speech units selected from the pool of pre-recorded segments of speech;
determining a respective weight set for the text input unit, the respective weight set including a respective weight for each of the plurality of characteristics based on relative magnitudes of the respective degrees of variations that are present among the candidate speech units for the plurality of characteristics; and
based on the respective weight set for the text input unit, selecting a respective one of the respective plurality of candidate speech units to synthesize a respective speech output corresponding to the text input unit.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and apparatuses to perform context-aware unit selection for natural language processing are described. Streams of information associated with input units are received. The streams of information are analyzed in a context associated with first candidate units to determine a first set of weights of the streams of information. A first candidate unit is selected from the first candidate units based on the first set of weights of the streams of information. The streams of information are analyzed in the context associated with second candidate units to determine a second set of weights of the streams of information. A second candidate unit is selected from second candidate units to concatenate with the first candidate unit based on the second set of weights of the streams of information.
754 Citations
21 Claims
-
1. A machine-implemented method of text-to-speech generation, comprising:
at a device comprising one or more processors and memory; receiving a text input to be converted to speech, the text input including a sequence of text input units; and for each text input unit of the sequence of text input units; selecting, from a pool of pre-recorded segments of speech, a respective plurality of candidate speech units for the text input unit, wherein the respective plurality of candidate speech units differ from one another in regard to one or more of a plurality of characteristics; for each of the plurality of characteristics, determining a respective degree of variation present among the respective plurality of candidate speech units selected from the pool of pre-recorded segments of speech; determining a respective weight set for the text input unit, the respective weight set including a respective weight for each of the plurality of characteristics based on relative magnitudes of the respective degrees of variations that are present among the candidate speech units for the plurality of characteristics; and based on the respective weight set for the text input unit, selecting a respective one of the respective plurality of candidate speech units to synthesize a respective speech output corresponding to the text input unit. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
8. A non-transitory computer-readable medium having instructions stored thereon, the instruction, when executed by one or more processors, cause the processors to perform operations comprising:
-
receiving a text input to be converted to speech, the text input including a sequence of text input units; and for each text input unit of the sequence of text input units; selecting, from a pool of pre-recorded segments of speech, a respective plurality of candidate speech units for the text input unit, wherein the respective plurality of candidate speech units differ from one another in regard to one or more of a plurality of characteristics; for each of the plurality of characteristics, determining a respective degree of variation present among the respective plurality of candidate speech units selected from the pool of pre-recorded segments of speech; determining a respective weight set for the text input unit, the respective weight set including a respective weight for each of the plurality of characteristics based on relative magnitudes of the respective degrees of variations that are present among the candidate speech units for the plurality of characteristics; and based on the respective weight set for the text input unit, selecting a respective one of the respective plurality of candidate speech units to synthesize a respective speech output corresponding to the text input unit. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A system, comprising:
-
one or more processors; and memory having instructions stored thereon, the instructions, when executed by the one or more processors, cause the one or more processors to perform operations comprising; receiving a text input to be converted to speech, the text input including a sequence of text input units; and for each text input unit of the sequence of text input units; selecting, from a pool of pre-recorded segments of speech, a respective plurality of candidate speech units for the text input unit, wherein the respective plurality of candidate speech units differ from one another in regard to one or more of a plurality of characteristics; for each of the plurality of characteristics, determining a respective degree of variation present among the respective plurality of candidate speech units selected from the pool of pre-recorded segments of speech; determining a respective weight set for the text input unit, the respective weight set including a respective weight for each of the plurality of characteristics based on relative magnitudes of the respective degrees of variations that are present among the candidate speech units for the plurality of characteristics; and based on the respective weight set for the text input unit, selecting a respective one of the respective plurality of candidate speech units to synthesize a respective speech output corresponding to the text input unit. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification