System and method for audio-visual content synthesis

US 7,636,662 B2
Filed: 09/28/2004
Issued: 12/22/2009
Est. Priority Date: 09/30/2003
Status: Expired due to Fees

First Claim

Patent Images

1. An audio-visual content synthesis apparatus for (i) receiving audio-visual input signals that represent a speaker who is speaking and (ii) creating an animated version of the speaker'"'"'s face that represent the speaker'"'"'s speech, said apparatus comprising:

means for extracting (i) audio features of the speaker'"'"'s speech and (ii) visual features of the speaker'"'"'s face from the audio-visual input signals;

means for creating audiovisual input vectors from (i) the extracted audio features and (ii) the extracted visual features, wherein each audiovisual input vector comprises a hybrid logical unit that exhibits properties of both (a) the phonemes and (b) the visemes;

means for creating audiovisual configurations from the audiovisual input vectors, wherein the audiovisual configurations comprise speaking face movement components in an audiovisual space; and

means for performing a semantic association procedure on the audiovisual input vectors to obtain an association between phonemes that represent the speaker'"'"'s speech and visemes that represent the speaker'"'"'s face for each audiovisual input vector.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method is provided for synthesizing audio-visual content in a video image processor. A content synthesis application processor extracts audio features and video features from audio-visual input signals that represent a speaker who is speaking. The processor uses the extracted visual features to create a computer generated animated version of the face of the speaker. The processor synchronizes facial movements of the animated version of the face of the speaker with a plurality of audio logical units such as phonemes that represent the speaker'"'"'s speech. In this manner the processor synthesizes an audio-visual representation of the speaker'"'"'s face that is properly synchronized with the speaker'"'"'s speech.

Citations

20 Claims

1. An audio-visual content synthesis apparatus for (i) receiving audio-visual input signals that represent a speaker who is speaking and (ii) creating an animated version of the speaker'"'"'s face that represent the speaker'"'"'s speech, said apparatus comprising:
- means for extracting (i) audio features of the speaker'"'"'s speech and (ii) visual features of the speaker'"'"'s face from the audio-visual input signals;
  
  means for creating audiovisual input vectors from (i) the extracted audio features and (ii) the extracted visual features, wherein each audiovisual input vector comprises a hybrid logical unit that exhibits properties of both (a) the phonemes and (b) the visemes;
  
  means for creating audiovisual configurations from the audiovisual input vectors, wherein the audiovisual configurations comprise speaking face movement components in an audiovisual space; and
  
  means for performing a semantic association procedure on the audiovisual input vectors to obtain an association between phonemes that represent the speaker'"'"'s speech and visemes that represent the speaker'"'"'s face for each audiovisual input vector.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The apparatus as claimed in claim 1, further comprising:
    - means for analyzing an input audio signal, wherein said input audio signal analyzing means is configured for;
      
      extracting audio features of a speaker'"'"'s speech from the input audio signal;
      
      finding corresponding video representations for the extracted audio features using a semantic association procedure; and
      
      matching the corresponding video representations with the audiovisual configurations.
  - 3. The apparatus as claimed in claim 2, wherein said analyzing matches the corresponding video representations with the audiovisual configurations using one of:
    - a Hidden Markov Model and a Time Delayed Neural Network.
  - 4. The apparatus as claimed in claim 2, wherein said semantic association procedure comprises one of:
    - latent semantic indexing, canonical correlation, and cross modal factor analysis.
  - 5. The apparatus as claimed in claim 2, further comprising:
    - means for creating a computer generated animated face for each selected audiovisual configuration;
      
      means for synchronizing each computer generated animated face with the speaker'"'"'s speech of the input audio signal; and
      
      means for outputting an audio-visual representation of the speaker'"'"'s face synchronized with the speaker'"'"'s speech.
  - 6. The apparatus as claimed in claim 5, further comprising:
    - means for implementing a facial audio visual feature matching and classification module that matches each of a plurality of audiovisual configurations with a corresponding classified audio feature to create a facial animation parameter; and
      
      means for implementing a facial animation for selected parameters module that creates an animated version of the face of the speaker for a selected facial animation parameter.
  - 7. The apparatus as claimed in claim 6, wherein said facial animation for selected parameters module creates an animated version of the face of the speaker by using one of:
    - (1) 3D models with texture mapping and (2) video editing.
  - 8. The apparatus as claimed in claim 7, further comprising:
    - means for implementing a speaking face animation and synchronization module that synchronizes each animated version of the face of the speaker with the audio features of the speaker'"'"'s speech to create an audio-visual representation of the speaker'"'"'s face that is synchronized with the speaker'"'"'s speech; and
      
      means for implementing an audio expression classification module that determines a level of audio expression of the speaker'"'"'s speech and provides said level of audio expression of the speaker'"'"'s speech to said speaking face animation and synchronization module to use to modify animated facial parameters of the speaker in response to the determined level of audio expression.
  - 9. The apparatus as claimed in claim 1, wherein the audio features extracted from the audio-visual input signals comprise one of:
    - Mel Cepstral Frequency Coefficients, LinearPredictive Coding Coefficients, Delta Mel Cepstral Frequency Coefficients, Delta Linear Predictive Coding Coefficients, and Autocorrelation Mel Cepstral Frequency Coefficients.
  - 10. The apparatus as claimed in claim 1, wherein said means for creating audiovisual configurations creates the audiovisual configurations from the audiovisual input vectors using one of:
    - a Hidden Markov Model and a Time Delayed Neural Network.

11. A method for use in synthesizing audio-visual content in a video image processor, said method comprising the steps of:
- receiving audio-visual input signals that represent a speaker who is speaking;
  
  extracting (i) audio features of the speaker'"'"'s speech and (ii) visual features of the speaker'"'"'s face from the audio-input signals;
  
  creating audiovisual input vectors from (i) the extracted audio features and (ii) the extracted visual features, wherein each audiovisual input vector comprises a hybrid logical unit that exhibits properties of both (a) the phonemes and (b) the visemes;
  
  creating audiovisual configurations from the audiovisual input vectors, wherein the audiovisual configurations comprise speaking face movement components in an audiovisual space; and
  
  performing a semantic association procedure on the audiovisual input vectors to obtain an association between phonemes that represent the speaker'"'"'s speech and visemes that represent the speaker'"'"'s face for each audiovisual input vector.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 20)
- - 12. The method as claimed in claim 11, further comprising:
    - analyzing an input audio signal of a speaker'"'"'s speech, wherein analyzing includes;
      
      extracting audio features of the speaker'"'"'s speech from the input audio signal;
      
      finding corresponding video representations for the extracted audio features using a semantic association procedure; and
      
      matching the corresponding video representations with the audiovisual configurations.
  - 13. The method as claimed in claim 12, further comprising the steps of:
    - creating a computer generated animated face for each selected audiovisual configuration;
      
      synchronizing each computer generated animated face with the speaker'"'"'s speech of the input audio signal; and
      
      outputting an audio-visual representation of the speaker'"'"'s face synchronized with the speaker'"'"'s speech.
  - 14. The method as claimed in claim 12, wherein the corresponding video representations are matched with the audiovisual configurations using one of:
    - a Hidden Markov Model and a Time Delayed Neural Network.
  - 15. The method as claimed in claim 12, wherein said semantic association procedure comprises one of:
    - latent semantic indexing, canonical correlation, and cross modal factor analysis.
  - 16. The method as claimed in claim 11, wherein the audio features extracted from the audio-visual input signals comprise one of:
    - Mel Cepstral Frequency Coefficients, Linear Predictive Coding Coefficients, Delta Mel Cepstral Frequency Coefficients, Delta Linear Predictive Coding Coefficients, and Autocorrelation Mel Cepstral Frequency Coefficients.
  - 17. The method as claimed in claim 11, wherein the audiovisual configurations are created from the audiovisual input vectors using one of:
    - a Hidden Markov Model and a Time Delayed Neural Network.
  - 18. The method as claimed in claim 11, further comprising the steps of:
    - matching each of a plurality of audiovisual configurations with a corresponding classified audio feature to create a facial animation parameter; and
      
      creating an animated version of the face of the speaker for a selected facial animation parameter.
  - 20. The method as claimed in claim 15, further comprising the steps of:
    - synchronizing each animated version of the face of the speaker with the audio features of the speaker'"'"'s speech;
      
      creating an audio-visual representation of the face of the speaker that is synchronized with the speaker'"'"'s speech;
      
      determining a level of audio expression of the speaker'"'"'s speech; and
      
      modifying animated facial parameters of the speaker in response to a determination of the level of audio expression of the speaker'"'"'s speech in response to the determined level of audio expression.

19. The method as claimed in 18, further comprising the step of:
- creating an animated version of the face of the speaker by using one of;
  
  (1) 3D models with texture mapping and (2) video editing.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Koninklijke Philips Electronics N.V. (Koninklijke Philips N.V.)
Original Assignee
Koninklijke Philips Electronics N.V. (Koninklijke Philips N.V.)
Inventors
Miller, Andrew, Dimtrova, Nevenka, Li, Dongge
Primary Examiner(s)
ABEBE, DANIEL DEMELASH

Application Number

US10/573,736
Publication Number

US 20060290699A1
Time in Patent Office

1,911 Days
Field of Search

704/258, 704/260, 704/265, 345/473, 345/474
US Class Current

704/260
CPC Class Codes

G06F 18/2415   based on parametric or prob...

G06T 13/205   driven by audio data

G06T 13/40   of characters, e.g. humans,...

G06V 10/764   using classification, e.g. ...

G06V 40/168   Feature extraction; Face re...

G10L 2021/105   Synthesis of the lips movem...

H04N 21/23412   for generating or manipulat...

H04N 21/2368   Multiplexing of audio and v...

H04N 21/42203   sound input device, e.g. mi...

H04N 21/43072   of multiple content streams...

H04N 21/4312   involving specific graphica...

H04N 21/4314   for fitting data in a restr...

H04N 21/4341   Demultiplexing of audio and...

H04N 21/4751   for defining user accounts,...

H04N 21/8106   involving special audio dat...

System and method for audio-visual content synthesis

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for audio-visual content synthesis

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links