Photo-realistic synthesis of three dimensional animation with facial features synchronized with speech
First Claim
1. A computer-implemented method for generating photo-realistic facial animation synchronized with speech, comprising:
- storing, in a computer storage device, a statistical model of audiovisual data over time, based on acoustic feature vectors from actual audio data and visual feature vectors of lips images extracted from real sample images of a head and facial features of an individual during a set of utterances by the individual;
storing, in an image library, the real sample images of the individual'"'"'s head and facial features during the set of utterances, including storing for each of the stored real sample images the visual feature vectors obtained from the lips image extracted from the real sample image as used to generate the statistical model;
receiving an input set of acoustic feature vectors for the speech with which the facial animation is to be synchronized;
using a computer processor, applying the received input set of acoustic feature vectors to the statistical model, the statistical model thereby generating a visual feature vector sequence;
selecting, using a computer processor, a sequence of real sample images of the individual'"'"'s head and facial features from the image library, such that the selected sequence matches the visual feature vector sequence generated using the statistical model by comparing visual feature vectors in the visual feature vector sequence with visual feature vectors associated with the real sample images in the image library; and
using a computer processor, applying the selected sequence of real sample images to the three dimensional model of a head to provide the photo-realistic facial animation synchronized with the speech.
2 Assignments
0 Petitions
Accused Products
Abstract
Dynamic texture mapping is used to create a photorealistic three dimensional animation of an individual with facial features synchronized with desired speech. Audiovisual data of an individual reading a known script is obtained and stored in an audio library and an image library. The audiovisual data is processed to extract feature vectors used to train a statistical model. An input audio feature vector corresponding to desired speech with which the animation will be synchronized is provided. The statistical model is used to generate a trajectory of visual feature vectors that corresponds to the input audio feature vector. These visual feature vectors are used to identify a matching image sequence from the image library. The resulting sequence of images, concatenated from the image library, provides a photorealistic image sequence with facial features, such as lip movements, synchronized with the desired speech. This image sequence is applied to the three-dimensional model.
34 Citations
20 Claims
-
1. A computer-implemented method for generating photo-realistic facial animation synchronized with speech, comprising:
-
storing, in a computer storage device, a statistical model of audiovisual data over time, based on acoustic feature vectors from actual audio data and visual feature vectors of lips images extracted from real sample images of a head and facial features of an individual during a set of utterances by the individual; storing, in an image library, the real sample images of the individual'"'"'s head and facial features during the set of utterances, including storing for each of the stored real sample images the visual feature vectors obtained from the lips image extracted from the real sample image as used to generate the statistical model; receiving an input set of acoustic feature vectors for the speech with which the facial animation is to be synchronized; using a computer processor, applying the received input set of acoustic feature vectors to the statistical model, the statistical model thereby generating a visual feature vector sequence; selecting, using a computer processor, a sequence of real sample images of the individual'"'"'s head and facial features from the image library, such that the selected sequence matches the visual feature vector sequence generated using the statistical model by comparing visual feature vectors in the visual feature vector sequence with visual feature vectors associated with the real sample images in the image library; and using a computer processor, applying the selected sequence of real sample images to the three dimensional model of a head to provide the photo-realistic facial animation synchronized with the speech. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer system for generating photo-realistic facial animation synchronized with speech, comprising:
-
a computer storage device storing a statistical model of audiovisual data over time, based on acoustic feature vectors from actual audio data and visual feature vectors of lips images extracted from real sample images of a head and facial features of an individual during a set of utterances by the individual; an image library storing real sample images of the individual'"'"'s head and facial features during the set of utterances, the image library further storing for each of the stored real sample images the visual feature vectors obtained from the lips image extracted from the real sample image as used to generate the statistical model; a synthesis module having an input for receiving an input set of feature vectors for speech with which the facial animation is to be synchronized, and providing as an output a visual feature vector sequence corresponding to the input set of feature vectors according to the statistical model; an image selection module having an input for receiving the visual feature vector sequence from the output of the synthesis module, and accessing the image library using the received visual feature vector sequence to generate an output providing a sequence of real sample images of the individual'"'"'s head and facial features from the image library having visual feature vectors that match the visual feature vectors in the visual feature vector sequence received from the synthesis module by comparing visual feature vectors in the visual feature vector sequence with visual feature vectors associated with the real sample images in the image library; and an animation module having an input for receiving a three dimensional model of a head and the sequence of real sample images from the image selection module, and an output providing the facial animation synchronized with the speech. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer program product comprising:
-
a computer storage device comprising at least one of a memory device or storage device; computer program instructions stored on the computer storage device that, when processed by a computing device, instruct the computing device to perform a method for generating photo-realistic facial animation synchronized with speech, comprising; storing, in a computer storage device, a statistical model of audiovisual data over time, based on acoustic feature vectors from actual audio data and visual feature vectors of lips images extracted from real sample images of a head and facial features of an individual during a set of utterances by the individual; storing, in an image library, real sample images of the individual'"'"'s head and facial features during the set of utterances, the image library further storing for each of the stored real sample images the visual feature vectors obtained from the lips image extracted from the real sample image as used to generate the statistical model; receiving an input set of acoustic feature vectors for the speech with which the facial animation is to be synchronized; using a computer processor, applying the received input set of acoustic feature vectors to the statistical model, the statistical model thereby generating a visual feature vector sequence; selecting, using a computer processor, a sequence of real sample images of the individual'"'"'s head and facial features from the image library, such that the selected sequence matches the visual feature vector sequence generated using the statistical model by comparing visual feature vectors in the visual feature vector sequence with visual feature vectors associated with the real sample images in the image library; and using a computer processor, applying the selected sequence of real sample images to the three dimensional model of a head to provide the photo-realistic facial animation synchronized with the speech. - View Dependent Claims (18, 19, 20)
-
Specification