Techniques for providing audio and video effects

US 10,861,210 B2
Filed: 07/11/2018
Issued: 12/08/2020
Est. Priority Date: 05/16/2017
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

at an electronic device having at least a camera and a microphone;

displaying a virtual avatar generation interface;

displaying first preview content of a virtual avatar in the virtual avatar generation interface, the first preview content of the virtual avatar corresponding to real-time preview video frames of a user headshot in a field of view of the camera and associated headshot changes in an appearance;

while displaying the first preview content of the virtual avatar, detecting an input in the virtual avatar generation interface;

in response to detecting the input in the virtual avatar generation interface;

capturing, via the camera, a video signal associated with the user headshot during a recording session;

capturing, via the microphone, a voice audio signal during the recording session; and

in response to detecting expiration of the recording session;

transforming the voice audio signal into a first set of voice audio features, the first set of voice audio features including at least one speech formant of the voice audio signal;

identifying a feature set of a predetermined voice audio signal associated with the virtual avatar;

generating a second set of voice audio features based at least in part on the first set of voice audio features and the feature set of the predetermined voice audio signal associated with the virtual avatar, the second set of voice audio features including a modified version of the at least one speech formant of the voice audio signal; and

composing a modified voice audio signal based at least in part on the second set of voice audio features;

generating second preview content of the virtual avatar in the virtual avatar generation interface according to the video signal and the modified voice audio signal; and

presenting the second preview content in the virtual avatar generation interface.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of the present disclosure can provide systems, methods, and computer-readable medium for providing audio and/or video effects based at least in part on facial features and/or voice feature characteristics of the user. For example, video and/or an audio signal of the user may be recorded by a device. Voice audio features and facial feature characteristics may be extracted from the voice audio signal and the video, respectively. The facial features of the user may be used to modify features of a virtual avatar to emulate the facial feature characteristics of the user. The extracted voice audio features may modified to generate an adjusted audio signal or an audio signal may be composed from the voice audio features. The adjusted/composed audio signal may simulate the voice of the virtual avatar. A preview of the modified video/audio may be provided at the user'"'"'s device.

65 Citations

20 Claims

1. A method, comprising:
- at an electronic device having at least a camera and a microphone;
  
  displaying a virtual avatar generation interface;
  
  displaying first preview content of a virtual avatar in the virtual avatar generation interface, the first preview content of the virtual avatar corresponding to real-time preview video frames of a user headshot in a field of view of the camera and associated headshot changes in an appearance;
  
  while displaying the first preview content of the virtual avatar, detecting an input in the virtual avatar generation interface;
  
  in response to detecting the input in the virtual avatar generation interface;
  
  capturing, via the camera, a video signal associated with the user headshot during a recording session;
  
  capturing, via the microphone, a voice audio signal during the recording session; and
  
  in response to detecting expiration of the recording session;
  
  transforming the voice audio signal into a first set of voice audio features, the first set of voice audio features including at least one speech formant of the voice audio signal;
  
  identifying a feature set of a predetermined voice audio signal associated with the virtual avatar;
  
  generating a second set of voice audio features based at least in part on the first set of voice audio features and the feature set of the predetermined voice audio signal associated with the virtual avatar, the second set of voice audio features including a modified version of the at least one speech formant of the voice audio signal; and
  
  composing a modified voice audio signal based at least in part on the second set of voice audio features;
  
  generating second preview content of the virtual avatar in the virtual avatar generation interface according to the video signal and the modified voice audio signal; and
  
  presenting the second preview content in the virtual avatar generation interface.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein the first set of voice audio features includes an envelope and fine structure of the voice audio signal, the envelope representing a magnitude of the voice audio signal over time, the fine structure include at least one of a frequency or a phase of the voice audio signal.
  - 3. The method of claim 1, wherein transforming the voice audio signal into the first set of voice audio features includes utilizing a short-term Fourier transform.
  - 4. The method of claim 3, wherein composing the modified voice audio signal includes utilizing an inverse short-term Fourier transform.

5. An electronic device, comprising:
- a speaker;
  
  a camera;
  
  a microphone; and
  
  one or more processors in communication with the speaker, the camera, and the microphone, the one or more processors configured to;
  
  display a virtual avatar generation interface;
  
  display first preview content of a virtual avatar in the virtual avatar generation interface, the first preview content of the virtual avatar corresponding to real-time preview video frames of a user headshot in a field of view of the camera and associated headshot changes in an appearance;
  
  while displaying the first preview content of the virtual avatar, detect an input in the virtual avatar generation interface;
  
  in response to detecting the input in the virtual avatar generation interface;
  
  capture, via the camera, a video signal associated with the user headshot during a recording session;
  
  capture, utilizing the microphone, a voice audio signal during the recording; and
  
  in response to detecting expiration of the recording session;
  
  transform the voice audio signal into a first set of voice audio features, the first set of voice audio features including a formant of the voice audio signal;
  
  identify a feature set of a predetermined voice audio signal associated with a virtual avatar;
  
  generate a second set of voice audio features based at least in part on the first set of voice audio features and the feature set of the predetermined voice audio signal associated with the virtual avatar; and
  
  compose a modified voice audio signal according to the second set of voice audio features;
  
  generate second preview content of the virtual avatar in the virtual avatar generation interface according to the video signal and the modified voice audio signal; and
  
  present the second preview content in the virtual avatar generation interface.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 6. The electronic device of claim 5, wherein the feature set of the predetermined voice audio signal is based at least in part on a type of the virtual avatar.
  - 7. The electronic device of claim 6, wherein the type of the virtual avatar is received based at least in part a user selection of an avatar type selection option presented on a user interface of the electronic device.
  - 8. The electronic device of claim 5, wherein the first set of voice audio features includes a formant of the voice audio signal, and wherein the second set of voice audio features is generated based at least in part on shifting the formant of the first set of voice audio features.
  - 9. The electronic device of claim 5, wherein the second set of voice audio features generated modify the voice audio signal to simulate the predetermined voice audio signal associated with the virtual avatar.
  - 10. The electronic device of claim 5, wherein the first set of voice audio features including an envelope and a fine structure of the voice audio signal, the envelope representing a magnitude of the voice audio signal over time, the fine structure representing at least one of a frequency or a phase of the voice audio signal.
  - 11. The electronic device of claim 10, wherein the second set of voice audio features are generated based at least in part on modifying the phase of the voice audio signal, and wherein modifying the phase of the voice audio signal causes the modified voice audio signal composed from the second set of voice audio features to simulate the predetermined voice audio signal associated with the virtual avatar.
  - 12. The electronic device of claim 10, wherein the second set of voice audio features are generated based at least in part on modifying the magnitude and the phase of the voice audio signal according to the feature set of the predetermined voice audio signal associated with the virtual avatar.
  - 13. The electronic device of claim 5, wherein the one or more processors are further configured to:
    - generate a machine-learning model from past signal modifications associated with individually modifying a plurality of voice audio signals associated with a plurality of users to substantially match the predetermined voice audio signal associated with the virtual avatar, the machine-learning model being configured to receive a voice audio signal feature set as input and produce a resultant voice audio signal feature set as output;
      
      provide, to the machine-learning model, the first set of voice audio features associated with the voice audio signal corresponding to a user; and
      
      obtain, from the machine-learning model, the second set of voice audio features, wherein the modified voice audio signal composed from the second set of voice audio features causes the voice audio signal of the user to be substantially match a vocal signal associated with the virtual avatar.
  - 14. The electronic device of claim 13, wherein the one or more processors are further configured to:
    - extract facial feature characteristics associated with the face from the video signal; and
      
      generate adjusted facial metadata based at least in part on the facial feature characteristics and the modified voice audio signal.
  - 15. The electronic device of claim 14, wherein the modified voice audio signal is presented with a visual representation of the virtual avatar in the second preview content of the virtual avatar generation interface, the visual representation of the virtual avatar being presented based at least in part on the adjusted facial metadata.

16. A non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by one or more processors, configure the one or more processors to perform operations comprising:
- displaying a virtual avatar generation interface;
  
  receiving, at the virtual avatar generation interface, a selection associated with a virtual avatar, the virtual avatar being associated with particular vocal characteristics;
  
  displaying first preview content of the virtual avatar in the virtual avatar generation interface, the first preview content of the virtual avatar corresponding to real-time preview video frames of a user headshot in a field of view of the camera and associated headshot changes in an appearance;
  
  while displaying the first preview content of the virtual avatar, detecting an input in the virtual avatar generation interface;
  
  in response to detecting the input in the virtual avatar generation interface;
  
  capturing, utilizing a camera, a video signal associated with the user headshot during a recording session;
  
  capturing, utilizing a microphone and the virtual avatar generation interface, a voice audio signal during the recording session; and
  
  in response to detecting expiration of the recording session;
  
  transforming the voice audio signal of the user into a first set of voice audio features, the first set of voice audio features including at least one of speech formant of the voice audio signal;
  
  generating a second set of voice audio features based at least in part on the first set of voice audio features and the particular vocal characteristics associated with the virtual avatar; and
  
  composing a modified voice audio signal according to the second set of voice audio features;
  
  generating second preview content of the virtual avatar in the virtual avatar generation interface according to the video signal and the modified voice audio signal; and
  
  presenting the second preview content in the virtual avatar generation interface.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The non-transitory computer-readable storage medium of claim 16, wherein the second set of voice audio features are generating based at least in part on replacing the phase with a predetermined phase associated with the virtual avatar.
  - 18. The non-transitory computer-readable storage medium of claim 16, wherein transforming the voice audio signal of the user into a first set of signal features utilizes a short-term Fourier transform of the first set of signal features, and wherein composing the modified voice audio signal according to the second set of voice audio features utilizes an inverse short-term Fourier transform of the second set of voice audio features.
  - 19. The non-transitory computer-readable storage medium of claim 18, wherein the one or more processors are further configured to perform operations comprising:
    - identifying a formant of the voice audio signal based at least in part on the envelope; and
      
      modifying the formant according to a window function, wherein modifying the formant according to the window function causes the formant to widen or contract.
  - 20. The non-transitory computer-readable storage medium of claim 16, wherein the one or more processors are further configured to perform operations comprising:
    - extracting facial feature characteristics associated with the face from the video signal;
      
      generating adjusted facial metadata based at least in part on the facial feature characteristics and the modified voice audio signal; and
      
      presenting, with the modified voice audio signal, a visual representation of the virtual avatar in the second preview content of the virtual avatar generation interface according to the adjusted facial metadata.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Avendano, Carlos M., Ramprashad, Sean A.
Primary Examiner(s)
Sharma, Neeraj

Application Number

US16/033,111
Publication Number

US 20180336713A1
Time in Patent Office

881 Days
Field of Search

None
US Class Current
CPC Class Codes

G06T 13/205   driven by audio data

G06T 13/40   of characters, e.g. humans,...

G10L 2021/0135   Voice conversion or morphing

G10L 21/003   Changing voice quality, e.g...

G10L 21/013   Adapting to target pitch

Techniques for providing audio and video effects

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

65 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Techniques for providing audio and video effects

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

65 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links