Joint audio-video facial animation system
First Claim
Patent Images
1. A method comprising:
- accessing audio data and video data at a client device, the audio data comprising a speech signal;
determining locations of a set of facial landmarks based on the video data;
identifying a user profile based on the locations of the set of facial landmarks, the user profile comprising a selection of a user avatar;
generating a weighted finite state transducer (WFST) based on at least the speech signal of the audio data;
performing a breadth-first search upon an output of the WFST;
determining a phone sequence based on the breadth-first search;
generating a first facial model based on the locations of the set of facial landmarks;
generating a second facial model based on the phone sequence;
constructing a composite facial model based on the first facial model, the second facial model, and the selection of the user avatar; and
causing display of the composite facial model at the client device.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention relates to a joint automatic audio visual driven facial animation system that in some example embodiments includes a full scale state of the art Large Vocabulary Continuous Speech Recognition (LVCSR) with a strong language model for speech recognition and obtained phoneme alignment from the word lattice.
-
Citations
14 Claims
-
1. A method comprising:
-
accessing audio data and video data at a client device, the audio data comprising a speech signal; determining locations of a set of facial landmarks based on the video data; identifying a user profile based on the locations of the set of facial landmarks, the user profile comprising a selection of a user avatar; generating a weighted finite state transducer (WFST) based on at least the speech signal of the audio data; performing a breadth-first search upon an output of the WFST; determining a phone sequence based on the breadth-first search; generating a first facial model based on the locations of the set of facial landmarks; generating a second facial model based on the phone sequence; constructing a composite facial model based on the first facial model, the second facial model, and the selection of the user avatar; and causing display of the composite facial model at the client device. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system comprising:
-
a memory; and at least one hardware processor coupled to the memory and comprising instructions that causes the system to perform operations comprising; accessing audio data and video data at a client device, the audio data comprising a speech signal; determining locations of a set of facial landmarks based on the video data; identifying a user profile based on the locations of the set of facial landmarks, the user profile comprising a selection of a user avatar; generating a weighted finite state transducer (WFST) based on at least the speech signal of the audio data; performing a breadth-first search upon an output of the WFST; determining a phone sequence based on the breadth-first search; generating a first facial model based on the locations of the set of facial landmarks; generating a second facial model based on the phone sequence; constructing a composite facial model based on the first facial model, the second facial model, and the selection of the user avatar; and causing display of the composite facial model at the client device. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A non-transitory machine-readable storage medium comprising instructions that, when executed by one or more processors of a machine, cause the machine to perform operations comprising:
-
accessing audio data and video data at a client device, the audio data comprising a speech signal; determining locations of a set of facial landmarks based on the video data; identifying a user profile based on the locations of the set of facial landmarks, the user profile comprising a selection of a user avatar; generating a weighted finite state transducer (WFST) based on at least the speech signal of the audio data; performing a breadth-first search upon an output of the WFST; determining a phone sequence based on the breadth-first search; generating a first facial model based on the locations of the set of facial landmarks; generating a second facial model based on the phone sequence; constructing a composite facial model based on the first facial model, the second facial model, and the selection of the user avatar; and causing display of the composite facial model at the client device. - View Dependent Claims (12, 13, 14)
-
Specification