System and method for triphone-based unit selection for visual speech synthesis
First Claim
1. A non-transitory computer-readable storage medium storing instructions which, when executed by a computing device, cause the computing device to generate a video sequence having mouth movements synchronized with speech sounds, the instructions comprising:
- calculating a target cost for each candidate n-phone from a database of n-phones for a target sequence;
building a video frame lattice of candidate video frames based on the candidate n-phones;
assigning a joint cost to each pair of adjacent video frames; and
constructing the video sequence according to a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.
10 Assignments
0 Petitions
Accused Products
Abstract
A system and method for generating a video sequence having mouth movements synchronized with speech sounds are disclosed. The system utilizes a database of n-phones as the smallest selectable unit, wherein n is larger than 1 and preferably 3. The system calculates a target cost for each candidate n-phone for a target frame using a phonetic distance, coarticulation parameter, and speech rate. For each n-phone in a target sequence, the system searches for candidate n-phones that are visually similar according to the target cost. The system samples each candidate n-phone to get a same number of frames as in the target sequence and builds a video frame lattice of candidate video frames. The system assigns a joint cost to each pair of adjacent frames and searches the video frame lattice to construct the video sequence by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.
-
Citations
28 Claims
-
1. A non-transitory computer-readable storage medium storing instructions which, when executed by a computing device, cause the computing device to generate a video sequence having mouth movements synchronized with speech sounds, the instructions comprising:
-
calculating a target cost for each candidate n-phone from a database of n-phones for a target sequence; building a video frame lattice of candidate video frames based on the candidate n-phones; assigning a joint cost to each pair of adjacent video frames; and constructing the video sequence according to a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for generating a bitstream having mouth movements synchronized with speech sounds, the system comprising:
-
a processor; a first module configured to control the processor to calculate a target cost for each candidate n-phones from a database of n-phones for a target sequence; a second module configured to control the processor to build a video frame lattice of candidate video frames according to the candidate n-phones; a third module configured to control the processor to assign a joint cost to each pair of adjacent video frames; and a fourth module configured to control the processor to construct the video sequence according to a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A method of generating a video sequence having synchronized movement and sound, the method comprising:
-
building a video frame lattice of candidate video frames based on a calculated target cost for each candidate n-phone associated with a target sequence; assigning, via a processor, a joint cost to each pair of adjacent video frames in the video frame lattice; and generating the video sequence by finding the optimal path through the video frame lattice based at least in part on the target cost and the joint cost over the video sequence. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28)
-
Specification