System and method for audio-visual content synthesis
First Claim
1. An audio-visual content synthesis apparatus for (i) receiving audio-visual input signals that represent a speaker who is speaking and (ii) creating an animated version of the speaker'"'"'s face that represent the speaker'"'"'s speech, said apparatus comprising:
- means for extracting (i) audio features of the speaker'"'"'s speech and (ii) visual features of the speaker'"'"'s face from the audio-visual input signals;
means for creating audiovisual input vectors from (i) the extracted audio features and (ii) the extracted visual features, wherein each audiovisual input vector comprises a hybrid logical unit that exhibits properties of both (a) the phonemes and (b) the visemes;
means for creating audiovisual configurations from the audiovisual input vectors, wherein the audiovisual configurations comprise speaking face movement components in an audiovisual space; and
means for performing a semantic association procedure on the audiovisual input vectors to obtain an association between phonemes that represent the speaker'"'"'s speech and visemes that represent the speaker'"'"'s face for each audiovisual input vector.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method is provided for synthesizing audio-visual content in a video image processor. A content synthesis application processor extracts audio features and video features from audio-visual input signals that represent a speaker who is speaking. The processor uses the extracted visual features to create a computer generated animated version of the face of the speaker. The processor synchronizes facial movements of the animated version of the face of the speaker with a plurality of audio logical units such as phonemes that represent the speaker'"'"'s speech. In this manner the processor synthesizes an audio-visual representation of the speaker'"'"'s face that is properly synchronized with the speaker'"'"'s speech.
-
Citations
20 Claims
-
1. An audio-visual content synthesis apparatus for (i) receiving audio-visual input signals that represent a speaker who is speaking and (ii) creating an animated version of the speaker'"'"'s face that represent the speaker'"'"'s speech, said apparatus comprising:
-
means for extracting (i) audio features of the speaker'"'"'s speech and (ii) visual features of the speaker'"'"'s face from the audio-visual input signals; means for creating audiovisual input vectors from (i) the extracted audio features and (ii) the extracted visual features, wherein each audiovisual input vector comprises a hybrid logical unit that exhibits properties of both (a) the phonemes and (b) the visemes; means for creating audiovisual configurations from the audiovisual input vectors, wherein the audiovisual configurations comprise speaking face movement components in an audiovisual space; and means for performing a semantic association procedure on the audiovisual input vectors to obtain an association between phonemes that represent the speaker'"'"'s speech and visemes that represent the speaker'"'"'s face for each audiovisual input vector. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method for use in synthesizing audio-visual content in a video image processor, said method comprising the steps of:
-
receiving audio-visual input signals that represent a speaker who is speaking; extracting (i) audio features of the speaker'"'"'s speech and (ii) visual features of the speaker'"'"'s face from the audio-input signals; creating audiovisual input vectors from (i) the extracted audio features and (ii) the extracted visual features, wherein each audiovisual input vector comprises a hybrid logical unit that exhibits properties of both (a) the phonemes and (b) the visemes; creating audiovisual configurations from the audiovisual input vectors, wherein the audiovisual configurations comprise speaking face movement components in an audiovisual space; and performing a semantic association procedure on the audiovisual input vectors to obtain an association between phonemes that represent the speaker'"'"'s speech and visemes that represent the speaker'"'"'s face for each audiovisual input vector. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 20)
-
-
19. The method as claimed in 18, further comprising the step of:
creating an animated version of the face of the speaker by using one of;
(1) 3D models with texture mapping and (2) video editing.
Specification