Trainable videorealistic speech animation
First Claim
1. Computer method for generating a videorealistic audio-visual animation of a subject comprising the computer implemented steps of:
- (a) receiving video recorded data of a subject including sound data of the subject enunciating certain speech and image data of the subject posing certain facial configurations;
(b) forming a multidimensional model based on the video recorded data, said model synthesizing at least images of new facial configurations, wherein for synthesizing images of new facial configurations, the model employs morphing according to mouth parameters;
(c) for a given target speech stream or a given target image stream, mapping from the target stream to a trajectory of parameters in the model, the trajectory of parameters enabling the model to generate at least one of (1) synthesized images of new facial configurations of the subject for the target stream and (2) synthesized speech sound of new speech of the subject for the target stream, wherein for synthesizing images of new facial configurations, the model uses the trajectory of parameters to smoothly morph between a subset of images of the certain facial configurations of the video recorded data; and
(d) rendering video images and speech sound from the synthesized images of new facial configurations of the subject and the target stream, such that a videorealistic audio-visual animation of the subject is produced.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for videorealistic, speech animation is disclosed. A human subject is recorded using a video camera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject'"'"'s mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. The two key components of this invention are 1) a multidimensional morphable model (MMM) to synthesize new, previously unseen mouth configurations from a small set of mouth image prototypes; and 2) a trajectory synthesis technique based on regularization, which is automatically trained from the recorded video corpus, and which is capable of synthesizing trajectories in MMM space corresponding to any desired utterance.
-
Citations
20 Claims
-
1. Computer method for generating a videorealistic audio-visual animation of a subject comprising the computer implemented steps of:
-
(a) receiving video recorded data of a subject including sound data of the subject enunciating certain speech and image data of the subject posing certain facial configurations; (b) forming a multidimensional model based on the video recorded data, said model synthesizing at least images of new facial configurations, wherein for synthesizing images of new facial configurations, the model employs morphing according to mouth parameters; (c) for a given target speech stream or a given target image stream, mapping from the target stream to a trajectory of parameters in the model, the trajectory of parameters enabling the model to generate at least one of (1) synthesized images of new facial configurations of the subject for the target stream and (2) synthesized speech sound of new speech of the subject for the target stream, wherein for synthesizing images of new facial configurations, the model uses the trajectory of parameters to smoothly morph between a subset of images of the certain facial configurations of the video recorded data; and (d) rendering video images and speech sound from the synthesized images of new facial configurations of the subject and the target stream, such that a videorealistic audio-visual animation of the subject is produced. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. Computer apparatus for generating videorealistic audio-visual animation of a subject comprising:
-
(a) a member for receiving video recorded data of a subject including sound data of the subject enunciating certain speech and image data of the subject posing certain facial configurations; (b) a model building mechanism for forming a multidimensional model based on the video recorded data, said model for synthesizing at least images of new facial configurations, wherein for synthesizing images of new facial configurations, the model employs morphing according to mouth parameters; and (c) a synthesis module responsive to a given target speech stream or a given target image speech stream, the synthesis module mapping from the target stream to a trajectory of parameters in the formed model, the trajectory of parameters enabling the model to generate at least one of (1) synthesized images of new facial configurations of the subject for the target stream and (2) synthesized speech sound of new speech of the subject for the target stream, wherein for synthesizing images of new facial configurations, the model uses the trajectory of parameters to smoothly morph between a subset of images of the certain facial configurations of the video recorded data, the synthesis module rendering audio-video images and speech sound from the synthesized images of new facial configurations of the subject and the target stream, such that a videorealistic audio-visual animation of the subject is produced. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer system for generating audio-visual animation of a subject, comprising:
-
a source of video recorded data of a subject including sound data of the subject enunciating certain speech and image data of the subject posing certain facial configurations; multidimensional modeling means based on the video recorded data, said modeling means synthesizing at least images of new facial configurations, wherein for synthesizing images of new facial configurations, the modeling means employs morphing according to mouth parameters; a synthesis module responsive to a given target stream and using the modeling means mapping from the target stream to a trajectory of parameters in the modeling means, the trajectory of parameters enabling the modeling means to generate at least one of (1) synthesized images of new facial configurations of the subject for the target stream and (2) synthesized speech sound of new speech of the subject for the target stream, wherein for synthesizing images of new facial configurations, the modeling means uses the trajectory of parameters to smoothly morph between a subset of images of the certain facial configurations of the video recorded data, and therefrom the synthesis module rendering video images and speech sound from the synthesized images of new facial configurations of the subject and the target stream, such that a videorealistic audio-visual animation of the subject is produced. - View Dependent Claims (16, 17, 18)
-
-
19. Computer method for generating a videorealistic audio-visual animation of a subject comprising the computer implemented steps of:
-
(a) receiving video recorded data of a subject including sound data of the subject enunciating certain speech and image data of the subject posing certain facial configurations; (b) forming a multidimensional model based on the video recorded data, said model synthesizing at least images of new facial configurations, wherein for synthesizing images of new facial configurations, the model employs morphing according to mouth parameters; (c) for a given target speech stream or a given target image stream, mapping from the target stream to a trajectory of parameters in the model, the trajectory of parameters enabling the model to generate at least one of (1) synthesized images of new facial configurations of the subject for the target stream and (2) synthesized speech sound of new speech of the subject for the target stream, wherein for synthesizing images of new facial configurations, the model uses the trajectory of parameters to smoothly morph between a subset of images of the certain facial configurations of the video recorded data; (d) rendering video images and speech sound from the synthesized images of new facial configurations of the subject and the target stream, such that a videorealistic audio-visual animation of the subject is produced; the model employing morphing according to mouth parameters includes a mouth shape parameter α and
a mouth appearance parameter β
; andthe step of forming the multidimensional model includes; for each image of a subset of images from the image data of the video recorded data, computing optical flow vectors C from the image to every other image in the subset; computing a mouth shape parameter α
value and a mouth appearance parameter β
value for each image of the subset; andbased on the computed mouth shape parameter α
values and computed mouth appearance parameter β
values, forming Gaussian clusters for respective phonemes, each Gaussian cluster having a means μ and
diagonal covariance Σ
for mathematically representing the respective phoneme,such that (i) given a target set of α
, β
values, the model produces a morph image of the subject having a mouth with a mouth shape and mouth appearance configuration corresponding to the target α
, β
values, and (ii) given a target mouth image, the model computes α
, β
values that represent the target mouth image with respect to the images in the subset.
-
-
20. Computer apparatus for generating videorealistic audio-visual animation of a subject comprising:
-
(a) a member for receiving video recorded data of a subject including sound data of the subject enunciating certain speech and image data of the subject posing certain facial configurations; (b) a model building mechanism for forming a multidimensional model based on the video recorded data, said model for synthesizing at least images of new facial configurations, wherein for synthesizing images of new facial configurations, the model employs morphing according to mouth parameters; (c) a synthesis module responsive to a given target speech stream or a given target image speech stream, the synthesis module mapping from the target stream to a trajectory of parameters in the formed model, the trajectory of parameters enabling the model to generate at least one of (1) synthesized images of new facial configurations of the subject for the target stream and (2) synthesized speech sound of new speech of the subject for the target stream, wherein for synthesizing images of new facial configurations, the model uses the trajectory of parameters to smoothly morph between a subset of images of the certain facial configurations of the video recorded data, the synthesis module rendering audio-video images and speech sound from the synthesized images of new facial configurations of the subject and the target stream, such that a videorealistic audio-visual animation of the subject is produced; the mouth parameters include a mouth shape parameter α and
a mouth appearance parameter β
; andthe model building mechanism forms the multidimensional model by; for each image of a subset of images from the image data of the video recorded data, computing optical flow vectors C from the image to every other image in the subset; computing a mouth shape parameter value α and
a mouth appearance parameter value β
for each image of the subset; andbased on the computed mouth shape parameter α
values and the computed mouth appearance parameter β
values, forming Gaussian clusters for respective phonemes, each Gaussian cluster having a means μ and
diagonal covariance Σ
for mathematically representing the respective phoneme,such that (i) given a target set of α
,β
values, the model produces a morph image of the subject having a mouth with a mouth shape and mouth appearance configuration corresponding to the target α
,β
values, and (ii) given a target mouth image, the model computes α
,β
values that represent the target mouth image with respect to the images in the subset.
-
Specification