Methods and devices for producing and using synthetic visual speech based on natural coarticulation
First Claim
Patent Images
1. A method of producing synthetic visual speech, comprising:
- receiving an input including speech information;
identifying visemes corresponding to the input;
calculating a weight of each of the visemes corresponding to the input using a coarticulation engine, wherein the coarticulation engine comprises viseme deformability information, and wherein each of the viseme weights corresponds to an amount of influence that the viseme has over other visemes active at a specified time; and
producing a synthetic visual speech output based on the weights of the visemes corresponding to the input.
3 Assignments
0 Petitions
Accused Products
Abstract
A method of producing synthetic visual speech according to this invention includes receiving an input containing speech information. One or more visemes that correspond to the speech input are then identified. Next, the weights of those visemes are calculated using a coarticulation engine including viseme deformability information. Finally, a synthetic visual speech output is produced based on the visemes'"'"' weights over time (or tracks). The synthetic visual speech output is combined with a synchronized audio output corresponding to the input to produce a multimedia output containing a 3D lipsyncing animation.
324 Citations
54 Claims
-
1. A method of producing synthetic visual speech, comprising:
-
receiving an input including speech information;
identifying visemes corresponding to the input;
calculating a weight of each of the visemes corresponding to the input using a coarticulation engine, wherein the coarticulation engine comprises viseme deformability information, and wherein each of the viseme weights corresponds to an amount of influence that the viseme has over other visemes active at a specified time; and
producing a synthetic visual speech output based on the weights of the visemes corresponding to the input. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
retrieving a target model for each of the visemes identified; and
morphing between the target models for the visemes using the weights of the visemes.
-
-
3. A method according to claim 2, wherein morphing between target models comprises base point mixing multiple target models using a morphing engine.
-
4. A method according to claim 1, wherein the coarticulation engine determines the weight of each viseme based on a variety of factors including viseme deformability, phoneme duration, and speech context.
-
5. A method according to claim 1, wherein the viseme deformability information comprises a strength and a deformability percentage value.
-
6. A method according to claim 1, wherein the input comprises a text input, and wherein identifying visemes corresponding to the input comprises:
-
identifying phonemes corresponding to the text input using a phoneme neural network; and
identifying visemes that correspond to the phonemes.
-
-
7. A method according to claim 6, wherein the input further comprises a voice input, and wherein using a coarticulation engine to calculate a weight of each of the visemes comprises:
-
forcing an alignment between the text input and the voice input to determine phoneme duration; and
inputting phoneme duration and context information into a coarticulation algorithm to determine viseme weights.
-
-
8. A method according to claim 1, wherein receiving an input comprises receiving a voice-only input, and wherein identifying visemes corresponding to the input comprises:
-
running the voice-only input through a speech recognition routine to determine probable phonemes of the input; and
identifying the visemes that correspond to the probable phonemes.
-
-
9. A method according to claim 8, wherein the coarticulation engine comprises a neural network.
-
10. A method according to claim 8, wherein the synthetic visual speech output is produced substantially simultaneously with the input.
-
11. A method of generating synthetic visual speech, comprising:
-
receiving a voice input including speech information;
classifying the voice input into phonemes using a phoneme neural network;
identifying a viseme corresponding to each of the phonemes from the phoneme neural network;
calculating a viseme track for each of the visemes using a viseme neural network, wherein the viseme track comprises a sequence of viseme weights over time, and wherein each viseme weight represents an amount of influence of the viseme over other visemes active at a given time;
morphing between target models of the visemes according to their tracks by producing a series of blended models; and
rendering the series of blended models sequentially to produce a visual speech animation. - View Dependent Claims (12, 13)
-
-
14. A computer readable medium storing computer code comprising:
-
instructions for receiving an input including speech information;
instructions for identifying visemes corresponding to the input;
instructions for calculating a weight of each of the visemes corresponding to the input using deformability information, wherein each of the viseme weights corresponds to an amount of influence that the viseme has over other visemes active at a given time; and
instructions for producing a synthetic visual speech output based on the weights of the visemes corresponding to the input. - View Dependent Claims (15, 16, 17)
-
-
18. A system for producing synthetic visual speech, comprising:
-
a receiver to receive an input representing a speech segment;
a first neural network to classify the speech segment according to its phonetic components;
a coarticulation engine comprising deformability information to determine viseme tracks corresponding to the phonetic components of the speech input, wherein the viseme tracks each comprise a sequence of viseme weights over time, and wherein each viseme weight represents an amount of influence of the viseme over other visemes active at a given time; and
a morphing engine for morphing between viseme models based on the viseme tracks to enable a realistic synthetic visual speech output corresponding to the speech segment. - View Dependent Claims (19, 20)
-
-
21. A coarticulation engine for calculating viseme tracks comprising:
-
a coarticulation algorithm configured to receive data inputs corresponding to a plurality of visemes;
said data inputs representing a context, and a duration of each of the visemes; and
said coarticulation engine further configured to produce data outputs comprising a weight for each of the visemes using deformability information, wherein each viseme weight represents an amount of influence of the viseme over other visemes active at a given time. - View Dependent Claims (22)
-
-
23. A method for generating a user-customizable three-dimensional lipsyncing greeting card, comprising:
-
receiving a user-defined input containing speech information;
converting the input into a customized electronic greeting card comprising a three-dimensional visual speech animation comprising a lipsyncing character synchronized with an audio output corresponding to the input; and
delivering the customized electronic greeting card to a recipient identified by the user. - View Dependent Claims (24, 25, 26)
selecting one of a plurality of three-dimensional characters to be the lipsyncing character;
texture mapping an image onto a three-dimensional character model to produce a personalized character to be the lipsyncing character;
supplying a background image for the animation;
enabling one or more auto-expressions for the lipsyncing character to provide realistic non-speech movements;
selecting one or more emotions for the lipsyncing character to convey emotional content through visual expressions;
selecting a singing voice for the audio output; and
selecting voice characteristics for the audio output.
-
-
25. A method according to claim 23, wherein delivering the customized electronic greeting card to a recipient identified by the user comprises:
-
sending an electronic mail or on-line delivery notification to the recipient; and
making the customized electronic greeting card available for download by the recipient at an internet site.
-
-
26. A method according to claim 25, wherein the customized electronic greeting card is made available to the recipient in either a movie format, a streaming media format, or a real-time rendering format.
-
27. A method of providing an electronic greeting card featuring a three-dimensional lipsyncing character, comprising:
-
providing an Internet site;
allowing a user to supply an input containing speech information to the internet site;
converting the user-supplied input into an electronic greeting card comprising a three-dimensional lipsyncing character animated in synchronism with an audio output corresponding to the input; and
delivering the electronic greeting card to a recipient specified by the user. - View Dependent Claims (28, 29)
sending an online or email notification to the recipient; and
allowing the recipient to download the electronic greeting card from the internet site.
-
-
29. A method according to claim 27, wherein delivering the electronic greeting card to a recipient comprises:
sending an email containing the electronic greeting card as an attachment to the recipient at an address specified by the user.
-
30. A method for producing a computer animated lipsyncing, comprising:
-
providing a voice input into a first neural network to produce a phoneme output;
providing the phoneme output from the first neural network to a second neural network to produce a viseme track output, wherein the viseme track output comprises a sequence of viseme weights over time, and wherein each viseme weight represents an amount of influence of the viseme over other visemes active at a given time; and
using the viseme track output to generate an animated three-dimensional lipsyncing image in real-time in substantial synchronism with an audio speech output corresponding to the voice input. - View Dependent Claims (31, 32, 33, 34, 35)
filtering the viseme track output to produce a filtered and smoothed viseme track output.
-
-
35. A method according to claim 30, wherein the three-dimensional lipsyncing image and audio speech output are produced substantially simultaneously with the voice input.
-
36. An apparatus for producing a lipsyncing animation, comprising:
-
a frame processor to identify frames of a voice input;
a first neural network to receive the frames of the voice input and to identify a probable phoneme corresponding to each of the frames;
a second neural network to receive the probable phonemes and identify viseme weights for one or more visemes active during each of the frames, wherein a viseme weight represents an amount of influence of a corresponding viseme over other visemes active during that frame; and
a rendering engine to render a three-dimensional lipsyncing animation based on the viseme weights in substantial synchronization with an audio output corresponding to the voice input. - View Dependent Claims (37, 38, 39, 40)
-
-
41. A method for producing a synthesized visual communication over a network comprising:
-
receiving an input containing speech information into a first networked device;
converting the input into phonetic speech components using a phoneme neural network;
converting the phonetic speech components into weighted visual speech information, wherein the weighted visual speech information comprises information representing an amount of influence of a visual speech component over other visual speech components active at a given time;
producing a lipsyncing animation based on the weighted visual speech information; and
displaying the lipsyncing animation in substantial synchronism with an audibilization of a voice output corresponding to the input through a second networked device. - View Dependent Claims (42, 43, 44, 45)
receiving a second input containing speech information into the second networked device to be converted into synthetic visual speech to be displayed using the first networked device.
-
-
46. A method for providing a real-time synthetic communication comprising:
-
providing inputs containing speech information into a first one or more of a plurality of devices;
converting the inputs into viseme tracks, wherein the viseme tracks each comprise a sequence of viseme weights over time, and wherein each viseme weight represents an amount of influence of the viseme over other visemes active at a given time;
producing a communication comprising a synthesized visual speech animation for each of the inputs based on the viseme tracks, said communication further comprising an audio output corresponding to the input; and
outputting the communication through a second one or more of the devices. - View Dependent Claims (47, 48, 49, 50)
-
-
51. An email reader comprising:
-
a phoneme neural classifier for converting an email text or an audio attachment into its constituent plurality of phonemes;
a coarticulation engine to determine a weight of each of a plurality of visemes associated with each of the phonemes, wherein each viseme weight represents an amount of influence of the corresponding viseme over other visemes active at a given time;
a morphing engine for morphing between target viseme models based on viseme weights;
a text-to-audio speech synthesizer for synthesizing an audio voice output based on the phonemes from the email text; and
a rendering engine for rendering an email lipsyncing animation based on data from the morphing engine. - View Dependent Claims (52, 53, 54)
an output formatter for combining the animation and the synthesized audio voice output into a multimedia output.
-
-
53. An email reader according to claim 51, further comprising:
user-customization options to allow a user to select a lipsyncing character for the animation and a voice-type for the voice output.
-
54. An email reader according to claim 53, wherein the user-customization options are configured to allow independent selection of the character and voice type for each of a plurality of email senders.
Specification