Audio-visual selection process for the synthesis of photo-realistic talking-head animations
First Claim
1. A method for the synthesis of photo-realistic animation of an object using a unit selection process, comprising the steps of:
- a) creating a first database of image samples showing an object in a plurality of appearances;
b) creating a second database of visual features for each image sample of the object;
c) creating a third database of non-visual characteristics of the object in each image sample;
d) obtaining for each frame in a plurality of N frames of an animation, a target feature vector comprised of the visual features and the non-visual characteristics;
e) for each frame in the plurality of N frames of the animation, selecting candidate image samples from the first database using a comparison of a combination of visual features from the second database and non-visual characteristics from the third databases with the target feature vector; and
f) compiling the selected candidates to form a photo-realistic animation.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for generating photo-realistic talking-head animation from a text input utilizes an audio-visual unit selection process. The lip-synchronization is obtained by optimally selecting and concatenating variable-length video units of the mouth area. The unit selection process utilizes the acoustic data to determine the target costs for the candidate images and utilizes the visual data to determine the concatenation costs. The image database is prepared in a hierarchical fashion, including high-level features (such as a full 3D modeling of the head, geometric size and position of elements) and pixel-based, low-level features (such as a PCA-based metric for labeling the various feature bitmaps).
79 Citations
21 Claims
-
1. A method for the synthesis of photo-realistic animation of an object using a unit selection process, comprising the steps of:
-
a) creating a first database of image samples showing an object in a plurality of appearances;
b) creating a second database of visual features for each image sample of the object;
c) creating a third database of non-visual characteristics of the object in each image sample;
d) obtaining for each frame in a plurality of N frames of an animation, a target feature vector comprised of the visual features and the non-visual characteristics;
e) for each frame in the plurality of N frames of the animation, selecting candidate image samples from the first database using a comparison of a combination of visual features from the second database and non-visual characteristics from the third databases with the target feature vector; and
f) compiling the selected candidates to form a photo-realistic animation. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
a) calculating a pose of the object as it appears on an image sample of the first database; and
b) reprojecting the object onto an intermediate image using a normalized pose.
-
-
4. The method as defined in claim 3 wherein the pose of the object is calculated using a set of at least four 3D object points and their corresponding image projection and applying standard pose estimation algorithms.
-
5. The method as defined in claim 3 wherein the step of reprojection further comprises:
-
a) projecting 3D quadrilaterals defining the overall shape of the object on the image using the object'"'"'s calculated pose, marking 2D quadrilateral boundaries;
b) projecting the same quadrilaterals onto an intermediate image using a standard pose, marking a second set of 2D quadrilaterals; and
c) performing a quadrilateral-to-quadrilateral mapping for each quadrilateral in the object from the image sample to the intermediate, normalized image.
-
-
6. The method as defined in claim 2 wherein the features comprise the projections of the normalized sub-part image onto a subset of its principal components, the principal components being calculated from a set of available normalized sub-part images using a principal component analysis (PCA).
-
7. The method as defined in claim 2 wherein the visual features comprise a wavelet decomposition of the images, each image is transformed with a wavelet transform, and a subset of the wavelet coefficients is selected as feature vectors for the images.
-
8. The method as defined in claim 2 wherein the visual features comprise a projection onto a set of selected template images and a pixel-by-pixel multiplication is calculated to generate coefficients representing feature vectors for the images.
-
9. The method as defined in claim 6 wherein PCA is performed on subsampled and cropped images of the normalized image samples.
-
10. The method as defined in claim 6 wherein PCA is performed on luminance images of the normalized image samples.
-
11. The method as defined in claim 1 wherein selecting candidate image samples from the first database further comprises:
-
a) selecting, for each frame, a number of candidates image samples from the first database based on the target feature vector;
b) calculating, for each pair of candidates of two consecutive frames, a concatenation cost from a combination of visual features from the second database and object characteristics from the third database; and
c) performing a Viterbi search to find the least expensive path through the candidates accumulating a target cost and concatenation costs.
-
-
12. The method as defined in claim 11, wherein the concatenation cost is given by the Euclidian distance in the space of visual features between two candidates.
-
13. The method as defined in claim 12 wherein an additional concatenation cost g is calculated from the respective recording timestamps of the image samples u1, u2 using the following formula:
-
0<
w1<
w2<
. . . <
wp, seq(u)=recorded13 sequence_number and fr(u)=recorded_frame_number.
-
-
14. The method as defined in claim 1 wherein the animation is a talking-head animation, the first database stores sample images of a face that speaks, the second database stores associated facial visual features and the third database stores acoustic information for each frame in the form of phonemes.
-
15. The method as defined in claim 4 wherein the pose of the object is calculated using the position of the inner and outer corners of the left and right eye and the two nostrils.
-
16. The method as defined in claim 14 wherein visual features are extracted from normalized images of the mouth area including lips, chin and cheeks.
-
17. The method as defined in claim 16 wherein the extracted visual features comprise projections onto a set of principal components calculated using principled component analysis on a database of normalized mouth samples.
-
18. The method as defined in claim 16 wherein the extracted visual features comprise shape and position of the outer and inner lip contour, of the upper and lower teeth and of the tongue.
-
19. The method as defined in claim 11, wherein the target cost is calculated by the following steps:
-
a) defining a phonetic context by including in the cost calculation nl frames left of the current frame and nr frame right of it;
b) obtaining a target phonetic vector for each frame t, the target feature vector described as T(t)={pht−
nl, pht−
nl−
1, . . . , pht−
1, pht, pht+1, . . . , pht+nr−
1, pht+nr}, where phi is the phoneme being articulated at frame;
c) defining a weight vector W(t)={wt−
nl, wt−
nl−
1, . . . , wt−
1, wt, wt+1, . . . , wt+nr−
1, wt+nr};
d) defining a phoneme distance matrix M[p1,p2] that gives the distance between two phonemes;
e) getting a candidate'"'"'s phonetic vector from the third database U(u)={pht−
nl, pht−
nl−
1, . . . , pht−
1, pht, pht+1, . . . , pht+nr−
1, pht+nr}; and
f) computing the target cost TC, using the following;
-
-
20. The method as defined in claim 19, wherein elements of the weight vector are calculated using the following equation:
- wi=e−
α
|t−
i|.
- wi=e−
-
21. The method as defined in claim 19, wherein the phoneme distance matrix M is populated using similarity between their visemic representation.
Specification