Trainable videorealistic speech animation

US 7,168,953 B1
Filed: 01/27/2003
Issued: 01/30/2007
Est. Priority Date: 01/27/2003
Status: Active Grant

First Claim

Patent Images

1. Computer method for generating a videorealistic audio-visual animation of a subject comprising the computer implemented steps of:

(a) receiving video recorded data of a subject including sound data of the subject enunciating certain speech and image data of the subject posing certain facial configurations;

(b) forming a multidimensional model based on the video recorded data, said model synthesizing at least images of new facial configurations, wherein for synthesizing images of new facial configurations, the model employs morphing according to mouth parameters;

(c) for a given target speech stream or a given target image stream, mapping from the target stream to a trajectory of parameters in the model, the trajectory of parameters enabling the model to generate at least one of (1) synthesized images of new facial configurations of the subject for the target stream and (2) synthesized speech sound of new speech of the subject for the target stream, wherein for synthesizing images of new facial configurations, the model uses the trajectory of parameters to smoothly morph between a subset of images of the certain facial configurations of the video recorded data; and

(d) rendering video images and speech sound from the synthesized images of new facial configurations of the subject and the target stream, such that a videorealistic audio-visual animation of the subject is produced.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for videorealistic, speech animation is disclosed. A human subject is recorded using a video camera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject'"'"'s mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. The two key components of this invention are 1) a multidimensional morphable model (MMM) to synthesize new, previously unseen mouth configurations from a small set of mouth image prototypes; and 2) a trajectory synthesis technique based on regularization, which is automatically trained from the recorded video corpus, and which is capable of synthesizing trajectories in MMM space corresponding to any desired utterance.

Citations

20 Claims

1. Computer method for generating a videorealistic audio-visual animation of a subject comprising the computer implemented steps of:
- (a) receiving video recorded data of a subject including sound data of the subject enunciating certain speech and image data of the subject posing certain facial configurations;
  
  (b) forming a multidimensional model based on the video recorded data, said model synthesizing at least images of new facial configurations, wherein for synthesizing images of new facial configurations, the model employs morphing according to mouth parameters;
  
  (c) for a given target speech stream or a given target image stream, mapping from the target stream to a trajectory of parameters in the model, the trajectory of parameters enabling the model to generate at least one of (1) synthesized images of new facial configurations of the subject for the target stream and (2) synthesized speech sound of new speech of the subject for the target stream, wherein for synthesizing images of new facial configurations, the model uses the trajectory of parameters to smoothly morph between a subset of images of the certain facial configurations of the video recorded data; and
  
  (d) rendering video images and speech sound from the synthesized images of new facial configurations of the subject and the target stream, such that a videorealistic audio-visual animation of the subject is produced.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer method of claim 1 wherein:
    - the model employing morphing according to mouth parameters includes a mouth shape parameter α and
      
      a mouth appearance parameter β
      
      ; and
      
      the step of forming the multidimensional model includes;
      
      for each image of a subset of images from the image data of the video recorded data, computing optical flow vectors C from the image to every other image in the subset;
      
      computing a mouth shape parameter α
      
      value and a mouth appearance parameter β
      
      value for each image of the subset; and
      
      based on the computed mouth shape parameter α
      
      values and computed mouth appearance parameter β
      
      values, forming Gaussian clusters for respective phonemes, each Gaussian cluster having a means μ and
      
      diagonal covariance Σ
      
      for mathematically representing the respective phoneme,such that (i) given a target set of α
      
      , β
      
      values, the model produces a morph image of the subject having a mouth with a mouth shape and mouth appearance configuration corresponding to the target α
      
      , β
      
      values, and (ii) given a target mouth image, the model computes α
      
      , β
      
      values that represent the target mouth image with respect to the images in the subset.
  - 3. The computer method as claimed in claim 2 wherein given a target α
    - , β
      
      , the model produces a morph image I according to
  - 4. A computer method as claimed in claim 2 wherein given a target speech stream, the step of mapping employs the relationship ^(D^T^Σ
    - ^−
      
      1^D+λ
      
      W^T^W)_y=D^T^Σ^−
      
      1^Dμ to generate a trajectory y having T (α
      
      , β
      
      ) parameter values; and
      
      from the generated trajectory of α
      
      , β
      
      parameter values, the model produces a set of morph images of the subject corresponding to the target speech stream;
      
      where
  - 5. The computer method as claimed in claim 2 further comprising the step of adjusting means μ
    - and covariances Σ
      
      using gradient descent learning.
  - 6. The computer method as claimed in claim 1 wherein the step of mapping further includes employing mathematical regularization to generate the trajectory of parameters.
  - 7. The computer method as claimed in claim 6 wherein the regularization includes a smoothness term.

8. Computer apparatus for generating videorealistic audio-visual animation of a subject comprising:
- (a) a member for receiving video recorded data of a subject including sound data of the subject enunciating certain speech and image data of the subject posing certain facial configurations;
  
  (b) a model building mechanism for forming a multidimensional model based on the video recorded data, said model for synthesizing at least images of new facial configurations, wherein for synthesizing images of new facial configurations, the model employs morphing according to mouth parameters; and
  
  (c) a synthesis module responsive to a given target speech stream or a given target image speech stream, the synthesis module mapping from the target stream to a trajectory of parameters in the formed model, the trajectory of parameters enabling the model to generate at least one of (1) synthesized images of new facial configurations of the subject for the target stream and (2) synthesized speech sound of new speech of the subject for the target stream, wherein for synthesizing images of new facial configurations, the model uses the trajectory of parameters to smoothly morph between a subset of images of the certain facial configurations of the video recorded data, the synthesis module rendering audio-video images and speech sound from the synthesized images of new facial configurations of the subject and the target stream, such that a videorealistic audio-visual animation of the subject is produced.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer apparatus of claim 8 wherein:
    - the mouth parameters include a mouth shape parameter α and
      
      a mouth appearance parameter β
      
      ; and
      
      the model building mechanism forms the multidimensional model by;
      
      for each image of a subset of images from the image data of the video recorded data, computing optical flow vectors C from the image to every other image in the subset;
      
      computing a mouth shape parameter value α and
      
      a mouth appearance parameter value β
      
      for each image of the subset; and
      
      based on the computed mouth shape parameter a values and the computed mouth appearance parameter β
      
      values, forming Gaussian clusters for respective phonemes, each Gaussian cluster having a means μ and
      
      diagonal covariance Σ
      
      for mathematically representing the respective phoneme,such that (i) given a target set of α
      
      ,β
      
      values, the model produces a morph image of the subject having a mouth with a mouth shape and mouth appearance configuration corresponding to the target α
      
      ,β
      
      values, and (ii) given a target mouth image, the model computes α
      
      ,β
      
      values that represent the target mouth image with respect to the images in the subset.
  - 10. The computer apparatus as claimed in claim 9 wherein given a target α
    - ,β
      
      , the model produces a morph image I according to
  - 11. The computer apparatus as claimed in claim 9 wherein given a target speech stream, the synthesis module employs the relationship
    (D^TΣ
    - ^−
      
      1D+λ
      
      W^TW)_y=D^TΣ
      
      ^−
      
      1Dμ
      
      to generate a trajectory y having T (α
      
      ,β
      
      ) parameter values; and
      
      from the generated trajectory of α
      
      ,β
      
      parameter values, the model produces a set of morph images of the subject corresponding to the target speech stream;
      
      where D is a duration-weighting matrixλ
      
      is a regularizer; and
      
      W is a first order difference operator.
  - 12. The computer apparatus as claimed in claim 9 wherein the model building mechanism adjusts means μ
    - and covariances Σ
      
      using gradient descent learning.
  - 13. The computer apparatus as claimed in claim 8 wherein the synthesis module employs mathematical regularization to generate the trajectory of parameters.
  - 14. The computer apparatus as claimed in claim 13 wherein the regularization includes a smoothness term.

15. A computer system for generating audio-visual animation of a subject, comprising:
- a source of video recorded data of a subject including sound data of the subject enunciating certain speech and image data of the subject posing certain facial configurations;
  
  multidimensional modeling means based on the video recorded data, said modeling means synthesizing at least images of new facial configurations, wherein for synthesizing images of new facial configurations, the modeling means employs morphing according to mouth parameters;
  
  a synthesis module responsive to a given target stream and using the modeling means mapping from the target stream to a trajectory of parameters in the modeling means, the trajectory of parameters enabling the modeling means to generate at least one of (1) synthesized images of new facial configurations of the subject for the target stream and (2) synthesized speech sound of new speech of the subject for the target stream, wherein for synthesizing images of new facial configurations, the modeling means uses the trajectory of parameters to smoothly morph between a subset of images of the certain facial configurations of the video recorded data, and therefrom the synthesis module rendering video images and speech sound from the synthesized images of new facial configurations of the subject and the target stream, such that a videorealistic audio-visual animation of the subject is produced.
- View Dependent Claims (16, 17, 18)
- - 16. A computer system as claimed in claim 15 wherein the mouth parameters include a mouth shape parameter and a mouth appearance parameter.
  - 17. A computer system as claimed in claim 15 wherein the synthesis module further employs mathematical regularization to generate the trajectory of parameters.
  - 18. A computer system as claimed in claim 17 wherein the regularization includes a smoothness term.

19. Computer method for generating a videorealistic audio-visual animation of a subject comprising the computer implemented steps of:
- (a) receiving video recorded data of a subject including sound data of the subject enunciating certain speech and image data of the subject posing certain facial configurations;
  
  (b) forming a multidimensional model based on the video recorded data, said model synthesizing at least images of new facial configurations, wherein for synthesizing images of new facial configurations, the model employs morphing according to mouth parameters;
  
  (c) for a given target speech stream or a given target image stream, mapping from the target stream to a trajectory of parameters in the model, the trajectory of parameters enabling the model to generate at least one of (1) synthesized images of new facial configurations of the subject for the target stream and (2) synthesized speech sound of new speech of the subject for the target stream, wherein for synthesizing images of new facial configurations, the model uses the trajectory of parameters to smoothly morph between a subset of images of the certain facial configurations of the video recorded data;
  
  (d) rendering video images and speech sound from the synthesized images of new facial configurations of the subject and the target stream, such that a videorealistic audio-visual animation of the subject is produced;
  
  the model employing morphing according to mouth parameters includes a mouth shape parameter α and
  
  a mouth appearance parameter β
  
  ; and
  
  the step of forming the multidimensional model includes;
  
  for each image of a subset of images from the image data of the video recorded data, computing optical flow vectors C from the image to every other image in the subset;
  
  computing a mouth shape parameter α
  
  value and a mouth appearance parameter β
  
  value for each image of the subset; and
  
  based on the computed mouth shape parameter α
  
  values and computed mouth appearance parameter β
  
  values, forming Gaussian clusters for respective phonemes, each Gaussian cluster having a means μ and
  
  diagonal covariance Σ
  
  for mathematically representing the respective phoneme,such that (i) given a target set of α
  
  , β
  
  values, the model produces a morph image of the subject having a mouth with a mouth shape and mouth appearance configuration corresponding to the target α
  
  , β
  
  values, and (ii) given a target mouth image, the model computes α
  
  , β
  
  values that represent the target mouth image with respect to the images in the subset.

20. Computer apparatus for generating videorealistic audio-visual animation of a subject comprising:
- (a) a member for receiving video recorded data of a subject including sound data of the subject enunciating certain speech and image data of the subject posing certain facial configurations;
  
  (b) a model building mechanism for forming a multidimensional model based on the video recorded data, said model for synthesizing at least images of new facial configurations, wherein for synthesizing images of new facial configurations, the model employs morphing according to mouth parameters;
  
  (c) a synthesis module responsive to a given target speech stream or a given target image speech stream, the synthesis module mapping from the target stream to a trajectory of parameters in the formed model, the trajectory of parameters enabling the model to generate at least one of (1) synthesized images of new facial configurations of the subject for the target stream and (2) synthesized speech sound of new speech of the subject for the target stream, wherein for synthesizing images of new facial configurations, the model uses the trajectory of parameters to smoothly morph between a subset of images of the certain facial configurations of the video recorded data, the synthesis module rendering audio-video images and speech sound from the synthesized images of new facial configurations of the subject and the target stream, such that a videorealistic audio-visual animation of the subject is produced;
  
  the mouth parameters include a mouth shape parameter α and
  
  a mouth appearance parameter β
  
  ; and
  
  the model building mechanism forms the multidimensional model by;
  
  for each image of a subset of images from the image data of the video recorded data, computing optical flow vectors C from the image to every other image in the subset;
  
  computing a mouth shape parameter value α and
  
  a mouth appearance parameter value β
  
  for each image of the subset; and
  
  based on the computed mouth shape parameter α
  
  values and the computed mouth appearance parameter β
  
  values, forming Gaussian clusters for respective phonemes, each Gaussian cluster having a means μ and
  
  diagonal covariance Σ
  
  for mathematically representing the respective phoneme,such that (i) given a target set of α
  
  ,β
  
  values, the model produces a morph image of the subject having a mouth with a mouth shape and mouth appearance configuration corresponding to the target α
  
  ,β
  
  values, and (ii) given a target mouth image, the model computes α
  
  ,β
  
  values that represent the target mouth image with respect to the images in the subset.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Massachusetts Institute of Technology
Original Assignee
Massachusetts Institute of Technology
Inventors
Ezzat, Antoine F., Poggio, Tomaso A.
Primary Examiner(s)
Cheng; Joe H.

Application Number

US10/352,319
Time in Patent Office

1,464 Days
Field of Search

434/118, 434/156, 434/157, 434/169, 434/185, 434/307.R, 434/308, 434/362, 434/365, 704/256, 704/260, 704/271, 704/276, 345/473, 345/619, 715/500, 715/500.1, 715/854, 709/231, 707/10, 382/118
US Class Current

434/185
CPC Class Codes

G06T 13/205   driven by audio data

G06T 13/40   of characters, e.g. humans,...

G10L 2021/105   Synthesis of the lips movem...

Trainable videorealistic speech animation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Trainable videorealistic speech animation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links