System and method for real time lip synchronization

US 20060204060A1
Filed: 05/16/2006
Published: 09/14/2006
Est. Priority Date: 12/21/2002
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented process for synthesizing mouth motion to an audio signal, comprising the following process actions:

training Hidden Markov Models to create face states or face sequences for a given speech audio signal using substantially continuous images of face sequences correlated with the speech audio signal; and

using said trained Hidden Markov Models to generate mouth motions for a given audio input.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A novel method for synchronizing the lips of a sketched face to an input voice. The lip synchronization system and method approach is to use training video as much as possible when the input voice is similar to the training voice sequences. Initially, face sequences are clustered from video segments, then by making use of sub-sequence Hidden Markov Models, a correlation between speech signals and face shape sequences is built. From this re-use of video, the discontinuity between two consecutive output faces is decreased and accurate and realistic synthesized animations are obtained. The lip synchronization system and method can synthesize faces from input audio in real-time without noticeable delay. Since acoustic feature data calculated from audio is directly used to drive the system without considering its phonemic representation, the method can adapt to any kind of voice, language or sound.

43 Citations

View as Search Results

39 Claims

1. A computer-implemented process for synthesizing mouth motion to an audio signal, comprising the following process actions:
- training Hidden Markov Models to create face states or face sequences for a given speech audio signal using substantially continuous images of face sequences correlated with the speech audio signal; and
  
  using said trained Hidden Markov Models to generate mouth motions for a given audio input.
- View Dependent Claims (2, 3, 4, 5, 7, 8, 18, 20)
- - 2. The computer-implemented process of claim 1 wherein said training process action comprises:
    - inputting a training video comprising both video and audio data;
      
      quantizing facial and vocal data from said video and audio data;
      
      creating face states and face sequences from said facial and vocal data; and
      
      training Hidden Markov Models corresponding to said face states and said face sequences.
  - 3. The computer-implemented process of claim 2 wherein the process action of quantizing facial and vocal data comprises:
    - digitizing said video and audio data; and
      
      computing vocal data via acoustic analysis.
  - 4. The computer-implemented process of claim 3 wherein said acoustic analysis comprises employing Mel-Frequency Cepstrum coefficients.
  - 5. The computer-implemented process of claim 2 further comprising the process action of excluding silent audio frames corresponding to video frames from further processing prior to creating face states and face sequences.
  - 7. The computer-implemented process of claim 2 wherein the process action of quantizing facial data comprises the process action of creating a face shape.
  - 8. The computer-implemented process of claim 7 wherein the process action of quantizing facial data comprises the process action of:
    - labeling the face and identifying the mouth and its shape with control points via an eigenpoints algorithm.
  - 18. The computer-implemented process of claim 1 wherein the process action for generating mouth motions comprises:
    - inputting an audio signal;
      
      computing vocal data via acoustic analysis from said input audio; and
      
      exporting face shapes selected by using a combination of trained face state HMMs and face sequence HMMs.
  - 20. The computer-implemented process of claim 18 wherein the process action of computing vocal data via acoustic analysis from said input audio comprises employing Mel-Frequency Cepstrum Coefficients (MFCC).

6. (canceled)

9-17. -17. (canceled)

19. (canceled)

21. A system for synthesizing lip motion to coordinate with an audio input, the system comprising:
- a general purpose computing device; and
  
  a computer program comprising program modules executable by the computing device, wherein the computing device is directed by the program modules of the computer program to, train Hidden Markov Models to create face states or face sequences for a given speech audio signal using images of face sequences associated with an audio signal; and
  
  use said trained Hidden Markov Models to synthesize images of mouth motions for a given audio input.
- View Dependent Claims (22, 24, 25, 26, 27, 28)
- - 22. The system of claim 21 wherein the program module for training Hidden Markov Models comprises program sub-modules for:
    - inputting a correlated audio and video signal of a person speaking;
      
      computing acoustic parameters of the audio signal;
      
      forming an energy histogram for each frame of acoustic data;
      
      excluding silent frames of acoustic data using said energy histogram;
      
      generating face shapes corresponding to audio frames not excluded as silent frames;
      
      forming face sequences;
      
      forming face states;
      
      computing and training face sequence Hidden Markov Models; and
      
      computing and training face state Hidden Markov Models.
  - 24. The system of claim 22 wherein the sub-module for generating face shapes comprises sub-modules for:
    - identifying the features of the face with control points.
  - 25. The system of claim 22 wherein the program module for forming face sequences comprises sub-modules for:
    - creating sequences of faces from said facial data;
      
      breaking said sequences of faces into subsequences; and
      
      clustering said subsequences into similar sequences of faces.
  - 26. The system of claim 25 wherein said clustering sub-module employs a k-means clustering algorithm.
  - 27. The system of claim 22 wherein said sub-module for forming face states comprises sub-modules for:
    - creating a face state from each frame of said facial data; and
      
      clustering said face states into groupings of similar facial data.
  - 28. The system of claim 21 wherein the program module to use said trained Hidden Markov Models to synthesize images of mouth motions for a given audio input comprises sub-modules for:
    - inputting an audio signal;
      
      computing vocal data via acoustic analysis from said input audio; and
      
      exporting face shapes selected by using a combination of trained face state HHMs and face sequence HMMs.

23. (canceled)

29-30. -30. (canceled)

31. A computer-implemented process for synthesizing a video from an audio signal, comprising the following process actions:
- inputting a training video of synchronized audio and video frames;
  
  training a series of Hidden Markov Models, each of which represents ore of a sequence of consecutive characterized video frames of a face of a person speaking or a single characterized video frame of a face of a speaking person, with characterized segments of a portion of the audio associated with the particular frame sequence or frame represented by the HMM, such that given an audio input each HMM is capable of providing an indication of the probability that a portion of the audio input matches the portion of the audio of the training video used to train that HMM;
  
  consecutively inputting portions of an audio signal of a person'"'"'s voice into each trained HMM and identifying from the resulting HMM probability produced for each portion of the input audio a characterized frame or sequence of characterized frames best matching the inputted portion of the audio signal; and
  
  synthesizing a video sequence from the characterized frames identified as best matching the inputted audio portions and generating frames of the synthesized video by synchronizing the synthesized video sequence with associated portions of the input audio.
- View Dependent Claims (32, 33, 39)
- - 32. The computer-implemented process of claim 31 wherein said characterized segments the audio are characterized by generating acoustic parameters using Mel-Frequency Cepstrum Coefficients.
  - 33. The computer-implemented-process of claim 31 wherein said training process action further comprises excluding synchronized silent audio and video frames from use for training.
  - 39. The computer-implemented process of claim 31 wherein said training process action further comprises:
    - for said sequence of consecutive characterized video frames, clustering said face shapes of said sequence;
      
      computing a representative face subsequence for each cluster; and
      
      training said series of Hidden Markov Model using associated acoustic feature vectors;
      
      for said video frames of a face of a person speaking, clustering said face shapes;
      
      computing a representative face shape for each cluster; and
      
      training said series of Hidden Markov Model using associated acoustic feature vectors.

34-38. -38. (canceled)

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Huang, Ying, Shum, Heung-Yeung, Guo, Baining, Lin, Stephen Ssu-te

Granted Patent

US 7,433,490 B2
Time in Patent Office

Days
Field of Search
US Class Current

382/118
CPC Class Codes

G06V 40/20 Movements or behaviour, e.g...

G10L 2021/105 Synthesis of the lips movem...

System and method for real time lip synchronization

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

43 Citations

39 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for real time lip synchronization

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

43 Citations

39 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links