Coarticulation method for audio-visual text-to-speech synthesis
First Claim
1. A method for generating photorealistic talking heads, comprising the steps of:
- receiving an input stimulus;
reading data from a first library comprising images of phoneme sequences which correspond to the input stimulus;
reading, based on the data read from the first library, corresponding data from a second library comprising images of a talking subject; and
generating, using the data read from the second library, an animated sequence of a talking head tracking the input stimulus.
4 Assignments
0 Petitions
Accused Products
Abstract
A method for generating animated sequences of talking heads in text-to-speech applications wherein a processor samples a plurality of frames comprising image samples. Representative parameters are extracted from the image samples and stored in an animation library. The processor also samples a plurality of multiphones comprising images together with their associated sounds. The processor extracts parameters from these images comprising data characterizing mouth shapes, maps, rules, or equations, and stores the resulting parameters and sound information in a coarticulation library. The animated sequence begins with the processor considering an input phoneme sequence, recalling from the coarticulation library parameters associated with that sequence, and selecting appropriate image samples from the animation library based on that sequence. The image samples are concatenated together, and the corresponding sound is output, to form the animated synthesis.
-
Citations
32 Claims
-
1. A method for generating photorealistic talking heads, comprising the steps of:
-
receiving an input stimulus;
reading data from a first library comprising images of phoneme sequences which correspond to the input stimulus;
reading, based on the data read from the first library, corresponding data from a second library comprising images of a talking subject; and
generating, using the data read from the second library, an animated sequence of a talking head tracking the input stimulus. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method for generating a photo-realistic talking head for a text-to-speech synthesis application, comprising the steps of:
-
sampling images of a subject;
extracting a plurality of parameters from each image sample;
storing the image sample parameters into an animation library;
sampling multiphone images of the subject;
sampling sounds associated with the multiphone images;
extracting a plurality of parameters from each multiphone image sample;
storing the multiphone image parameters and associated sound samples into a coarticulation library;
reading, based on an input stimulus comprising one or more phoneme sequences, parameters from the coarticulation library corresponding to each phoneme sequence;
generating, using parameters from the animation library corresponding to the read parameters, a sequence of animated frames, the sequence tracking the input stimulus. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A processor-based method for generating a photo-realistic talking head for a text-to-speech synthesis application, comprising the steps of:
-
sampling images of a subject;
decomposing the subject images into a hierarchy of segments;
writing for each segment a set of parameters into memory, the segment parameter sets characterizing each segment;
sampling a plurality of phoneme sequences;
writing for each phoneme sequence a set of parameters into memory, the phoneme sequence parameter sets characterizing each phoneme sequence;
reading from memory, based upon an input stimulus, specific phoneme sequence parameter sets corresponding to the stimulus;
reading from memory, based upon the specific phoneme sequence parameter sets, corresponding specific segment parameter sets; and
generating a concatenated sequence of animated frames using the corresponding specific segment parameter sets. - View Dependent Claims (27, 28, 29, 30)
-
-
31. A method for generating a photo-realistic talking head for a text-to-speech synthesis application, comprising the steps of:
-
sampling images of a talking head;
extracting a plurality of parameters from each image sample;
writing the image sample parameters into an animation library;
sampling multiphone images of the subject;
sampling sounds associated with the multiphone images;
converting the sound samples into digital acoustic parameters;
extracting a plurality of parameters from each multiphone image sample;
storing the multiphone image parameters and associated acoustic parameters into a coarticulation library;
reading, based on an input stimulus comprising one or more phoneme sequences, parameters from the coarticulation library associated with each phoneme sequence;
generating, using parameters from the animation library, a sequence of animated frames corresponding to the read parameters and a sequence of associated sounds in synchrony with the animated frames sequence, the sequence of animated frames tracking the input stimulus. - View Dependent Claims (32)
-
Specification