PHOTO-REALISTIC SYNTHESIS OF IMAGE SEQUENCES WITH LIP MOVEMENTS SYNCHRONIZED WITH SPEECH

US 20120284029A1
Filed: 05/02/2011
Published: 11/08/2012
Est. Priority Date: 05/02/2011
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for generating photo-realistic facial animation with speech, comprising:

generating in a computer storage medium a statistical model of audiovisual data over time, based on acoustic feature vectors and visual feature vectors from audiovisual data of an individual'"'"'s articulators during speech;

generating using a computer processor a visual feature vector sequence using the statistical model corresponding to an input set of acoustic feature vectors for speech with which the facial animation is to be synchronized;

creating using a computer processor an image sample sequence from an image library using the generated visual feature vector sequence; and

processing the image sample sequence to provide the photo-realistic facial animation synchronized with the speech.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Audiovisual data of an individual reading a known script is obtained and stored in an audio library and an image library. The audiovisual data is processed to extract feature vectors used to train a statistical model. An input audio feature vector corresponding to desired speech with which a synthesized image sequence will be synchronized is provided. The statistical model is used to generate a trajectory of visual feature vectors that corresponds to the input audio feature vector. These visual feature vectors are used to identify a matching image sequence from the image library. The resulting sequence of images, concatenated from the image library, provides a photorealistic image sequence with lip movements synchronized with the desired speech.

12 Citations

View as Search Results

20 Claims

1. A computer-implemented method for generating photo-realistic facial animation with speech, comprising:
- generating in a computer storage medium a statistical model of audiovisual data over time, based on acoustic feature vectors and visual feature vectors from audiovisual data of an individual'"'"'s articulators during speech;
  
  generating using a computer processor a visual feature vector sequence using the statistical model corresponding to an input set of acoustic feature vectors for speech with which the facial animation is to be synchronized;
  
  creating using a computer processor an image sample sequence from an image library using the generated visual feature vector sequence; and
  
  processing the image sample sequence to provide the photo-realistic facial animation synchronized with the speech.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 9, 10)
- - 2. The computer-implemented method of claim 1, wherein generating the statistical model comprises:
    - obtaining audiovisual data including the individual'"'"'s articulators for a set of utterances;
      
      extracting the acoustic feature vectors and the visual feature vectors for each sample of the audiovisual data; and
      
      training the statistical model using the acoustic feature vectors and the visual feature vectors.
  - 3. The computer-implemented method of claim 1, wherein generating the visual feature vector sequence comprises maximizing a likelihood function with respect to the input acoustic feature vectors and the statistical model.
  - 4. The computer-implemented method of claim 1, wherein creating the image sample sequence comprises determining a set of image samples that minimizes a cost function.
  - 5. The computer-implemented method of claim 4, wherein the cost function comprises a target cost indicative of a difference between a generated visual feature vector and a visual feature vector related to an image.
  - 6. The computer-implemented method of claim 5, wherein the cost function comprises a concatenation cost indicative of a difference between adjacent images in the image sample sequence.
  - 7. The computer-implemented method of claim 1, wherein creating an image sample sequence from an image library using the generated visual feature vector sequence comprises identify a matching image sequence from the image library based on both a target cost and a concatenation cost.
  - 9. The computer system of claim 1, further comprising:
    - a training module having an input receiving acoustic feature vectors and visual feature vectors from audiovisual data of an individual'"'"'s articulators during a set of utterances and providing as an output a statistical model of the audiovisual data over time.
  - 10. The computer system of claim 9, wherein the training module comprises:
    - a feature extraction module having an input for receiving the audiovisual data and providing an output including the acoustic feature vectors and the visual feature vectors corresponding to each sample of the audiovisual data; and
      
      a statistical model training module having an input for receiving the acoustic feature vectors and the visual feature vectors and providing as an output the statistical model.

8. A computer system for generating photo-realistic facial animation with speech, comprising:
- a computer storage medium storing a statistical model of audiovisual data over time, based on acoustic feature vectors and visual feature vectors from audiovisual data of an individual'"'"'s articulators during a set of utterances;
  
  a synthesis module having an input for receiving an input set of feature vectors for speech with which the facial animation is to be synchronized, and providing as an output a visual feature vector sequence corresponding to the input set of feature vectors according to the statistical model;
  
  an image selection module having an input for receiving the visual feature vector sequence and an output providing an image sample sequence from an image library corresponding to the visual feature vector sequence.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The computer system of claim 8, wherein the synthesis module implements a maximum likelihood function with respect to the input acoustic feature vectors and the statistical model.
  - 12. The computer system of claim 8, wherein the image selection module implements a cost function and identifies a set of image samples that minimizes the cost function.
  - 13. The computer system of claim 12, wherein the cost function comprises a target cost indicative of a difference between a generated visual feature vector and a visual feature vector related to an image.
  - 14. The computer system of claim 13, wherein the cost function comprises a concatenation cost indicative of a difference between adjacent images in the image sample sequence.
  - 15. The computer system of claim 8, further comprising an image library, and wherein the image selection module accesses the image library using the generated visual feature vector sequence to identify a matching image sequence from the image library based on both a target cost and a concatenation cost.

16. A computer program product comprising:
- a computer storage medium;
  
  computer program instructions stored on the computer storage medium that, when processed by a computing device, instruct the computing device to perform a method for generating photo-realistic facial animation with speech, comprising;
  
  generating in a computer storage medium a statistical model of audiovisual data over time, based on acoustic feature vectors and visual feature vectors from audiovisual data of an individual'"'"'s articulators during speech;
  
  generating using a computer processor a visual feature vector sequence using the statistical model corresponding to an input set of acoustic feature vectors for speech with which the facial animation is to be synchronized;
  
  creating using a computer processor an image sample sequence from an image library using the generated visual feature vector sequence; and
  
  processing the image sample sequence to provide the photo-realistic facial animation synchronized with the speech.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The computer program product of claim 16, wherein generating the statistical model comprises:
    - obtaining audiovisual data including the individual'"'"'s articulators for a set of utterances;
      
      extracting the acoustic feature vectors and the visual feature vectors for each sample of the audiovisual data; and
      
      training the statistical model using the acoustic feature vectors and the visual feature vectors.
  - 18. The computer program product of claim 16, wherein generating the visual feature vector sequence comprises maximizing a likelihood function with respect to the input acoustic feature vectors and the statistical model.
  - 19. The computer program product of claim 16, wherein creating the image sample sequence comprises determining a set of image samples that minimizes a cost function.
  - 20. The computer program product of claim 19, wherein the cost function comprises a target cost indicative of a difference between a generated visual feature vector and a visual feature vector related to an image, and a concatenation cost indicative of a difference between adjacent images in the image sample sequence.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Wang, Lijuan, Soong, Frank

Granted Patent

US 9,728,203 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/270
CPC Class Codes

G10L 2021/105 Synthesis of the lips movem...

G10L 21/10 Transforming into visible i...

PHOTO-REALISTIC SYNTHESIS OF IMAGE SEQUENCES WITH LIP MOVEMENTS SYNCHRONIZED WITH SPEECH

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

12 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

PHOTO-REALISTIC SYNTHESIS OF IMAGE SEQUENCES WITH LIP MOVEMENTS SYNCHRONIZED WITH SPEECH

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

12 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links