System and method for real time lip synchronization
First Claim
1. A computer-implemented process for synthesizing mouth motion to an audio signal, comprising the following process actions:
- training Hidden Markov Models to create face states or face sequences for a given speech audio signal using substantially continuous images of face sequences correlated with the speech audio signal; and
using said trained Hidden Markov Models to generate mouth motions for a given audio input.
1 Assignment
0 Petitions
Accused Products
Abstract
A novel method for synchronizing the lips of a sketched face to an input voice. The lip synchronization system and method approach is to use training video as much as possible when the input voice is similar to the training voice sequences. Initially, face sequences are clustered from video segments, then by making use of sub-sequence Hidden Markov Models, a correlation between speech signals and face shape sequences is built. From this re-use of video, the discontinuity between two consecutive output faces is decreased and accurate and realistic synthesized animations are obtained. The lip synchronization system and method can synthesize faces from input audio in real-time without noticeable delay. Since acoustic feature data calculated from audio is directly used to drive the system without considering its phonemic representation, the method can adapt to any kind of voice, language or sound.
43 Citations
39 Claims
-
1. A computer-implemented process for synthesizing mouth motion to an audio signal, comprising the following process actions:
-
training Hidden Markov Models to create face states or face sequences for a given speech audio signal using substantially continuous images of face sequences correlated with the speech audio signal; and
using said trained Hidden Markov Models to generate mouth motions for a given audio input. - View Dependent Claims (2, 3, 4, 5, 7, 8, 18, 20)
-
-
6. (canceled)
-
9-17. -17. (canceled)
-
19. (canceled)
-
21. A system for synthesizing lip motion to coordinate with an audio input, the system comprising:
-
a general purpose computing device; and
a computer program comprising program modules executable by the computing device, wherein the computing device is directed by the program modules of the computer program to, train Hidden Markov Models to create face states or face sequences for a given speech audio signal using images of face sequences associated with an audio signal; and
use said trained Hidden Markov Models to synthesize images of mouth motions for a given audio input. - View Dependent Claims (22, 24, 25, 26, 27, 28)
-
-
23. (canceled)
-
29-30. -30. (canceled)
-
31. A computer-implemented process for synthesizing a video from an audio signal, comprising the following process actions:
-
inputting a training video of synchronized audio and video frames;
training a series of Hidden Markov Models, each of which represents ore of a sequence of consecutive characterized video frames of a face of a person speaking or a single characterized video frame of a face of a speaking person, with characterized segments of a portion of the audio associated with the particular frame sequence or frame represented by the HMM, such that given an audio input each HMM is capable of providing an indication of the probability that a portion of the audio input matches the portion of the audio of the training video used to train that HMM;
consecutively inputting portions of an audio signal of a person'"'"'s voice into each trained HMM and identifying from the resulting HMM probability produced for each portion of the input audio a characterized frame or sequence of characterized frames best matching the inputted portion of the audio signal; and
synthesizing a video sequence from the characterized frames identified as best matching the inputted audio portions and generating frames of the synthesized video by synchronizing the synthesized video sequence with associated portions of the input audio. - View Dependent Claims (32, 33, 39)
-
-
34-38. -38. (canceled)
Specification