Coarticulation method for audio-visual text-to-speech synthesis
First Claim
1. A method for generating a photorealistic talking head, comprising:
- receiving an input stimulus;
reading data from a first library comprising one or more parameters associated with mouth shape images of sequences of at least three concatenated phonemes which correspond to the input stimulus;
reading, based on the data read from the first library, corresponding data from a second library comprising images of a talking subject; and
generating, using the data read from the second library, an animated sequence of a talking head tracking the input stimulus.
4 Assignments
0 Petitions
Accused Products
Abstract
A method for generating animated sequences of talking heads in text-to-speech applications wherein a processor samples a plurality of frames comprising image samples. Representative parameters are extracted from the image samples and stored in an animation library. The processor also samples a plurality of multiphones comprising images together with their associated sounds. The processor extracts parameters from these images comprising data characterizing mouth shapes, maps, rules, or equations, and stores the resulting parameters and sound information in a coarticulation library. The animated sequence begins with the processor considering an input phoneme sequence, recalling from the coarticulation library parameters associated with that sequence, and selecting appropriate image samples from the animation library based on that sequence. The image samples are concatenated together, and the corresponding sound is output, to form the animated synthesis.
-
Citations
15 Claims
-
1. A method for generating a photorealistic talking head, comprising:
-
receiving an input stimulus;
reading data from a first library comprising one or more parameters associated with mouth shape images of sequences of at least three concatenated phonemes which correspond to the input stimulus;
reading, based on the data read from the first library, corresponding data from a second library comprising images of a talking subject; and
generating, using the data read from the second library, an animated sequence of a talking head tracking the input stimulus. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
reading acoustic data from the second library associated with the corresponding image data read from the second library;
converting the acoustic data into sound; and
outputting the sound in synchrony with the animated sequence of the talking head.
-
-
3. The method of claim 2, wherein the data read from the first library comprises one or more equations characterizing mouth shapes.
-
4. The method of claim 2, wherein said converting step is performed using a data-to-voice converter.
-
5. The method of claim 2, wherein the data read from the second library comprises segments of sampled images of a talking subject.
-
6. The method of claim 5, wherein said first library comprises a coarticulation library, and wherein said second library comprises an animation library.
-
7. The method of claim 5, wherein said generating step is performed by overlaying the segments onto a common interface to create frames comprising the animated sequence.
-
8. The method of claim 2, wherein the data read from the first library comprises mouth parameters characterizing degree of lip opening.
-
9. The method of claim 2, wherein said receiving, said generating, said converting, and all said reading steps are performed on a personal computer.
-
10. The method of claim 2, wherein said first and second libraries reside in a memory device on a computer.
-
11. The method of claim 1, wherein the data read from the first library comprises one or more equations characterizing mouth shapes.
-
12. A method for generating a photorealistic talking entity, comprising:
-
receiving an input stimulus;
reading, first data from a library comprising one or more parameters associated with mouth shape images of sequences of two concatenated phonemes and images of commonly-used sequences of at least three concatenated phonemes which correspond to the input stimulus;
reading, based on the first data, corresponding second data comprising stored images; and
generating, using the second data, an animated sequence of a talking entity tracking the input stimulus.
-
-
13. A method for generating a photorealistic talking entity, comprising:
-
receiving an input stimulus;
reading, based on at least one diphone, first data comprising one or more parameters associated with mouth shape images of sequences of concatenated phonemes which correspond to the input stimulus, the first data stored in a library comprising images of sequences associated with diphones and the most common images associated with triphones;
reading, based on the first data, corresponding second data comprising stored images; and
generating, using the second data, an animated sequence of a talking entity tracking the input stimulus. - View Dependent Claims (14, 15)
-
Specification