Talking facial display method and apparatus
First Claim
1. A method of converting input text into an audio-visual speech stream comprising a talking face image enunciating the text, wherein said audio-visual speech stream comprises a plurality of phonemes and timing information, wherein the tallking face image is built using a plurality of visemes, the method comprising the steps of:
- recording a visual corpus of a human-subject;
extracting and defining a plurality of visemes from the recorded visual corpus, said visemes being defined by a set of images spanning a range of mouth shapes derived from the recorded visual corpus;
building a viseme interpolation database, said database comprising a plurality of viseme images and at least one set of interpolation vectors that define a transition from each viseme image to every other viseme image, said viseme images in said interpolation database being a subset of said plurality of visemes extracted from said visual corpus, said set of interpolation vectors being computed automatically (i, in the absence of a definition of a set of high-level features and (ii) through the use of optical flow methods, said viseme interpolation database further comprising a corresponding set of intermediate viseme images automatically generated as a function of respective interpolation vectors; and
synchronizing the talking face image with an input text stream by employing said interpolation vectors and viseme images contained in said interpolation database, said synchronizing resulting in giving the impression of a photo-realistic talking face.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus of converting input text into an audio-visual speech stream resulting in a talking face image enunciating the text. This method of converting input text into an audio-visual speech stream comprises the steps of: recording a visual corpus of a human-subject, building a viseme interpolation database, and synchronizing the talking face image with the text stream. In a preferred embodiment, viseme transitions are automatically calculated using optical flow methods, and morphing techniques are employed to result in smooth viseme transitions. The viseme transitions are concatenated together and synchronized with the phonemes according to the timing information. The audio-visual speech stream is then displayed in real time, thereby displaying a photo-realistic talking face.
196 Citations
41 Claims
-
1. A method of converting input text into an audio-visual speech stream comprising a talking face image enunciating the text, wherein said audio-visual speech stream comprises a plurality of phonemes and timing information, wherein the tallking face image is built using a plurality of visemes, the method comprising the steps of:
-
recording a visual corpus of a human-subject;
extracting and defining a plurality of visemes from the recorded visual corpus, said visemes being defined by a set of images spanning a range of mouth shapes derived from the recorded visual corpus;
building a viseme interpolation database, said database comprising a plurality of viseme images and at least one set of interpolation vectors that define a transition from each viseme image to every other viseme image, said viseme images in said interpolation database being a subset of said plurality of visemes extracted from said visual corpus, said set of interpolation vectors being computed automatically (i, in the absence of a definition of a set of high-level features and (ii) through the use of optical flow methods, said viseme interpolation database further comprising a corresponding set of intermediate viseme images automatically generated as a function of respective interpolation vectors; and
synchronizing the talking face image with an input text stream by employing said interpolation vectors and viseme images contained in said interpolation database, said synchronizing resulting in giving the impression of a photo-realistic talking face. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
identifying each viseme as corresponding to a phoneme; and
extracting a plurality of visemes from said visual corpus.
-
-
7. The method of claim 6 wherein identifying each viseme comprises the steps of:
-
searching through said recording; and
relating each viseme on each recorded frame of said recording to a phoneme.
-
-
8. The method of claim 7 wherein the steps of searching and relating are performed manually.
-
9. The method of claim 7 wherein said relating each viseme comprises the steps of:
-
subjectively rating each viseme and phoneme combination; and
selecting a final set of visemes from among said rated viseme and phoneme combinations.
-
-
10. The method of claim 9 further comprising the step of attaching attributes to each viseme, said attributes defining characteristics of said human-subject.
-
11. The method of claim 10 wherein said characteristics of said human-subject are selected from a group consisting of eye position, eyelid position, head angle, head tilt, eyebrow position, shoulder position, posture, overall position within the frame.
-
12. The method of claim 10 wherein said attributes are used to separate the visemes into a plurality of viseme sets, said plurality of viseme sets containing about the same visemes, said plurality of viseme sets facilitating a reduction of repetitive movements thereby resulting in giving the impression of a more photo-realistic talking face.
-
13. The method of claim 6 further comprising the step of logging said plurality of visemes to a recording medium.
-
14. The method of claim 6 wherein extracting a plurality of visemes from said visual corpus results in at least one set of 16 visemes.
-
15. The method of claim 14 wherein a set of interpolation vectors define two hundred fifty-six viseme transitions.
-
16. The method of claim 1 wherein said viseme transitions are non-linear, said non-linear viseme transitions producing smooth dynamics between viseme images for a more photo-realistic talking face.
-
17. The method of claim 1 wherein said viseme transitions are performed using morphing techniques, said morphing techniques resulting in a smooth transition between viseme images for a more photo-realistic talking face.
-
18. The method of claim 1 wherein said synchronizing comprises the steps of:
-
concatenating a plurality of viseme transitions, said concatenating resulting in a complete visual utterance; and
extracting from a text-to-speech synthesizer phoneme and timing information, said phoneme and timing information being used to determine which viseme transitions from said database to use and what rate at which viseme transitions should occur.
-
-
19. The method of claim 18 further includes displaying the photo-realistic talking face in real time.
-
20. The method according the claim 1, wherein automatically generating the intermediate viseme images employs warping.
-
21. The method according to claim 20, wherein automatically generating the intermediate viseme images employs hole filling.
-
22. The method according to claim 21, wherein automatically generating the intermediate viseme images employs blending.
-
23. The method according to claim 1, wherein automatically generating the intermediate viseme images employs morphing.
-
24. The method according to claim 1, wherein the intermediate viseme images are located along respective interpolating vectors that define a transition from one viseme image to another viseme image.
-
25. The method according to claim 1, wherein the intermediate viseme images are located along new interpolation vectors computed as a function of respective computed interpolation vectors.
-
26. The method according to claim 25, wherein the new interpolation vectors are respective linear combinations of said computed interpolation vectors.
-
27. A system for generating and displaying a talking facial display comprising:
-
a computer;
an image source in electrical communication with the computer to transfer input images of a human-subject to the computer, the input images composing a visual corpus;
a text data source in electrical communication with the computer to transfer input text to the computer, the input text composing a text stream; and
processor routines executed by the computer, the processor routines comprising instructions to;
(i) build a viseme interpolation database, said database comprising a plurality of viseme images and at least one set of interpolation vectors that define a transition from each viseme image to every other viseme image, said viseme images in said interpolation database being a subset of a plurality of visemes extracted from said visual corpus, said set of interpolation vectors being computed automatically (i) in the absence of a definition of a set of high-level features and (ii) through the use of optical flow methods, said viseme interpolation database further comprising a corresponding set of intermediate viseme images automatically generated as a function of respective interpolation vectors; and
(ii) synchronize an image of a talking face with the text stream by employing said interpolation vectors and viseme images contained in said interpolation database, said synchronizing resulting in giving the impression of a photo-realistic talking face. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41)
-
Specification