Joint audio-video facial animation system

US 10,586,368 B2
Filed: 12/29/2017
Issued: 03/10/2020
Est. Priority Date: 10/26/2017
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

accessing audio data and video data at a client device, the audio data comprising a speech signal;

determining locations of a set of facial landmarks based on the video data;

identifying a user profile based on the locations of the set of facial landmarks, the user profile comprising a selection of a user avatar;

generating a weighted finite state transducer (WFST) based on at least the speech signal of the audio data;

performing a breadth-first search upon an output of the WFST;

determining a phone sequence based on the breadth-first search;

generating a first facial model based on the locations of the set of facial landmarks;

generating a second facial model based on the phone sequence;

constructing a composite facial model based on the first facial model, the second facial model, and the selection of the user avatar; and

causing display of the composite facial model at the client device.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to a joint automatic audio visual driven facial animation system that in some example embodiments includes a full scale state of the art Large Vocabulary Continuous Speech Recognition (LVCSR) with a strong language model for speech recognition and obtained phoneme alignment from the word lattice.

Citations

14 Claims

1. A method comprising:
- accessing audio data and video data at a client device, the audio data comprising a speech signal;
  
  determining locations of a set of facial landmarks based on the video data;
  
  identifying a user profile based on the locations of the set of facial landmarks, the user profile comprising a selection of a user avatar;
  
  generating a weighted finite state transducer (WFST) based on at least the speech signal of the audio data;
  
  performing a breadth-first search upon an output of the WFST;
  
  determining a phone sequence based on the breadth-first search;
  
  generating a first facial model based on the locations of the set of facial landmarks;
  
  generating a second facial model based on the phone sequence;
  
  constructing a composite facial model based on the first facial model, the second facial model, and the selection of the user avatar; and
  
  causing display of the composite facial model at the client device.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein the receiving the audio data and the video data at the client device further comprises:
    - receiving a video at the client device, wherein the video comprises the audio data and the video data; and
      
      extracting the audio data and the video data from the video.
  - 3. The method of claim 1, wherein the client device is a first client device, and the causing display of the composite facial model includes:
    - generating a message that includes the composite facial model; and
      
      causing display of a presentation of the message at a second client device, the presentation of the message comprising the composite facial model.
  - 4. The method of claim 1, wherein the constructing the composite facial model based on the first facial model and the second facial model occurs in real-time.
  - 5. The method of claim 1, wherein the locations are a first set of locations, the video data comprises a set of video frames, and the method further comprises:
    - detecting a loss in real-time data;
      
      parsing the video data to identify a first frame from among the set of video frames in response to the detecting the loss in real-time data;
      
      determining a second set of locations of the set of facial landmarks within the first frame of the video data; and
      
      altering the composite facial model based on the second set of locations of the set of facial landmarks.

6. A system comprising:
- a memory; and
  
  at least one hardware processor coupled to the memory and comprising instructions that causes the system to perform operations comprising;
  
  accessing audio data and video data at a client device, the audio data comprising a speech signal;
  
  determining locations of a set of facial landmarks based on the video data;
  
  identifying a user profile based on the locations of the set of facial landmarks, the user profile comprising a selection of a user avatar;
  
  generating a weighted finite state transducer (WFST) based on at least the speech signal of the audio data;
  
  performing a breadth-first search upon an output of the WFST;
  
  determining a phone sequence based on the breadth-first search;
  
  generating a first facial model based on the locations of the set of facial landmarks;
  
  generating a second facial model based on the phone sequence;
  
  constructing a composite facial model based on the first facial model, the second facial model, and the selection of the user avatar; and
  
  causing display of the composite facial model at the client device.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The system of claim 6, wherein the receiving the audio data and the video data at the client device further comprises:
    - receiving a video at the client device, wherein the video comprises the audio data and the video data; and
      
      extracting the audio data and the video data from the video.
  - 8. The system of claim 6, wherein the client device is a first client device, and the causing display of the composite facial model includes:
    - generating a message that includes the composite facial model; and
      
      causing display of a presentation of the message at a second client device, the presentation of the message comprising the composite facial model.
  - 9. The system of claim 6, wherein the constructing the composite facial model based on the first facial model and the second facial model occurs in real-time.
  - 10. The system of claim 6, wherein the locations are a first set of locations, the video data comprises a set of video frames, and the instructions cause the system to perform operations further comprising:
    - detecting a loss in real-time data;
      
      parsing the video data to identify a first frame from among the set of video frames in response to the detecting the loss in real-time data;
      
      determining a second set of locations of the set of facial landmarks within the first frame of the video data; and
      
      altering the composite facial model based on the second set of locations of the set of facial landmarks.

11. A non-transitory machine-readable storage medium comprising instructions that, when executed by one or more processors of a machine, cause the machine to perform operations comprising:
- accessing audio data and video data at a client device, the audio data comprising a speech signal;
  
  determining locations of a set of facial landmarks based on the video data;
  
  identifying a user profile based on the locations of the set of facial landmarks, the user profile comprising a selection of a user avatar;
  
  generating a weighted finite state transducer (WFST) based on at least the speech signal of the audio data;
  
  performing a breadth-first search upon an output of the WFST;
  
  determining a phone sequence based on the breadth-first search;
  
  generating a first facial model based on the locations of the set of facial landmarks;
  
  generating a second facial model based on the phone sequence;
  
  constructing a composite facial model based on the first facial model, the second facial model, and the selection of the user avatar; and
  
  causing display of the composite facial model at the client device.
- View Dependent Claims (12, 13, 14)
- - 12. The non-transitory machine readable storage medium of claim 11, wherein the receiving the audio data and the video data at the client device further comprises:
    - receiving a video at the client device, wherein the video comprises the audio data and the video data; and
      
      extracting the audio data and the video data from the video.
  - 13. The non-transitory machine readable storage medium of claim 11, wherein the client device is a first client device, and the causing display of the composite facial model includes:
    - generating a message that includes the composite facial model; and
      
      causing display of a presentation of the message at a second client device, the presentation of the message comprising the composite facial model.
  - 14. The non-transitory machine readable storage medium of claim 11, wherein the locations are a first set of locations, the video data comprises a set of video frames, and the instructions cause the machine to perform operations further comprising:
    - detecting a loss in real-time data;
      
      parsing the video data to identify a first frame from among the set of video frames in response to the detecting the loss in real-time data;
      
      determining a second set of locations of the set of facial landmarks within the first frame of the video data; and
      
      altering the composite facial model based on the second set of locations of the set of facial landmarks.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Snap, Inc.
Original Assignee
Snap, Inc.
Inventors
Cao, Chen, Chen, Xin, Chu, Wei, Xue, Zehao
Primary Examiner(s)
Demeter, Hilina K

Application Number

US15/858,992
Publication Number

US 20190130628A1
Time in Patent Office

802 Days
Field of Search

None
US Class Current
CPC Class Codes

G06T 13/205   driven by audio data

G06T 13/40   of characters, e.g. humans,...

G06V 20/64   Three-dimensional objects

G06V 40/161   Detection; Localisation; No...

G06V 40/171   Local features and componen...

G06V 40/176   Dynamic expression

G10L 2021/105   Synthesis of the lips movem...

G10L 21/10   Transforming into visible i...

H04L 51/08   Annexed information, e.g. a...

H04L 51/10   Multimedia information

H04L 51/222   using geographical location...

H04R 27/00   Public address systems circ...

Joint audio-video facial animation system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Joint audio-video facial animation system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links