Methods and devices for producing and using synthetic visual speech based on natural coarticulation

US 6,539,354 B1
Filed: 03/24/2000
Issued: 03/25/2003
Est. Priority Date: 03/24/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method of producing synthetic visual speech, comprising:

receiving an input including speech information;

identifying visemes corresponding to the input;

calculating a weight of each of the visemes corresponding to the input using a coarticulation engine, wherein the coarticulation engine comprises viseme deformability information, and wherein each of the viseme weights corresponds to an amount of influence that the viseme has over other visemes active at a specified time; and

producing a synthetic visual speech output based on the weights of the visemes corresponding to the input.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of producing synthetic visual speech according to this invention includes receiving an input containing speech information. One or more visemes that correspond to the speech input are then identified. Next, the weights of those visemes are calculated using a coarticulation engine including viseme deformability information. Finally, a synthetic visual speech output is produced based on the visemes'"'"' weights over time (or tracks). The synthetic visual speech output is combined with a synchronized audio output corresponding to the input to produce a multimedia output containing a 3D lipsyncing animation.

324 Citations

54 Claims

1. A method of producing synthetic visual speech, comprising:
- receiving an input including speech information;
  
  identifying visemes corresponding to the input;
  
  calculating a weight of each of the visemes corresponding to the input using a coarticulation engine, wherein the coarticulation engine comprises viseme deformability information, and wherein each of the viseme weights corresponds to an amount of influence that the viseme has over other visemes active at a specified time; and
  
  producing a synthetic visual speech output based on the weights of the visemes corresponding to the input.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. A method according to claim 1, wherein producing a synthetic visual speech output based on the weights of the visemes comprises:
3. A method according to claim 2, wherein morphing between target models comprises base point mixing multiple target models using a morphing engine.
4. A method according to claim 1, wherein the coarticulation engine determines the weight of each viseme based on a variety of factors including viseme deformability, phoneme duration, and speech context.
5. A method according to claim 1, wherein the viseme deformability information comprises a strength and a deformability percentage value.
6. A method according to claim 1, wherein the input comprises a text input, and wherein identifying visemes corresponding to the input comprises:
- identifying phonemes corresponding to the text input using a phoneme neural network; and
  
  identifying visemes that correspond to the phonemes.
7. A method according to claim 6, wherein the input further comprises a voice input, and wherein using a coarticulation engine to calculate a weight of each of the visemes comprises:
- forcing an alignment between the text input and the voice input to determine phoneme duration; and
  
  inputting phoneme duration and context information into a coarticulation algorithm to determine viseme weights.
8. A method according to claim 1, wherein receiving an input comprises receiving a voice-only input, and wherein identifying visemes corresponding to the input comprises:
- running the voice-only input through a speech recognition routine to determine probable phonemes of the input; and
  
  identifying the visemes that correspond to the probable phonemes.
9. A method according to claim 8, wherein the coarticulation engine comprises a neural network.
10. A method according to claim 8, wherein the synthetic visual speech output is produced substantially simultaneously with the input.

11. A method of generating synthetic visual speech, comprising:
- receiving a voice input including speech information;
  
  classifying the voice input into phonemes using a phoneme neural network;
  
  identifying a viseme corresponding to each of the phonemes from the phoneme neural network;
  
  calculating a viseme track for each of the visemes using a viseme neural network, wherein the viseme track comprises a sequence of viseme weights over time, and wherein each viseme weight represents an amount of influence of the viseme over other visemes active at a given time;
  
  morphing between target models of the visemes according to their tracks by producing a series of blended models; and
  
  rendering the series of blended models sequentially to produce a visual speech animation.
- View Dependent Claims (12, 13)
- - 12. A method of claim 11, wherein morphing between target models of the visemes according to their tracks comprises blending viseme target models together based on the weights of each viseme at given points in time determined by an output frame rate of the system.
  - 13. A method according to claim 11, wherein the steps of receiving an input, separating the input into phonemes, and calculating viseme tracks are performed on a server;
    - and wherein the steps of morphing between target models and rendering the series of blended models is performed on a client.

14. A computer readable medium storing computer code comprising:
- instructions for receiving an input including speech information;
  
  instructions for identifying visemes corresponding to the input;
  
  instructions for calculating a weight of each of the visemes corresponding to the input using deformability information, wherein each of the viseme weights corresponds to an amount of influence that the viseme has over other visemes active at a given time; and
  
  instructions for producing a synthetic visual speech output based on the weights of the visemes corresponding to the input.
- View Dependent Claims (15, 16, 17)
- - 15. A computer readable medium according to claim 14, wherein the instructions for receiving an input, identifying visemes, and calculating viseme weights are configured for use on a server;
    - and wherein the instructions for producing a synthetic visual speech output are configured for use on a client.
  - 16. A computer readable medium according to claim 14, wherein the instructions for receiving an input are configured for use on a server;
    - and wherein the instructions for identifying visemes, calculating weights, and producing a synthetic visual speech output are configured for use on a client.
  - 17. A computer readable medium according to claim 14, wherein the computer readable medium is configured for use on a server and further comprises instructions for transmitting a multimedia output to a client, the multimedia output comprising the synthetic visual speech output that is substantially synchronized with a voice output corresponding to the input.

18. A system for producing synthetic visual speech, comprising:
- a receiver to receive an input representing a speech segment;
  
  a first neural network to classify the speech segment according to its phonetic components;
  
  a coarticulation engine comprising deformability information to determine viseme tracks corresponding to the phonetic components of the speech input, wherein the viseme tracks each comprise a sequence of viseme weights over time, and wherein each viseme weight represents an amount of influence of the viseme over other visemes active at a given time; and
  
  a morphing engine for morphing between viseme models based on the viseme tracks to enable a realistic synthetic visual speech output corresponding to the speech segment.
- View Dependent Claims (19, 20)
- - 19. A system according to claim 18, wherein the input representing a speech segment comprises a text input and a voice input, and the system further comprises a forced alignment generator configured to force an alignment between the text input and the voice input to determine a duration of each of the phonetic components.
  - 20. A system according to claim 19, wherein the durations of the phonetic components are used in the coarticulation engine to determine the viseme tracks.

21. A coarticulation engine for calculating viseme tracks comprising:
- a coarticulation algorithm configured to receive data inputs corresponding to a plurality of visemes;
  
  said data inputs representing a context, and a duration of each of the visemes; and
  
  said coarticulation engine further configured to produce data outputs comprising a weight for each of the visemes using deformability information, wherein each viseme weight represents an amount of influence of the viseme over other visemes active at a given time.
- View Dependent Claims (22)
- - 22. A coarticulation engine according to claim 21, wherein the coarticulation engine comprises a neural network classifier trained on viseme deformability data;
    - wherein the data inputs to the neural network classifier comprise mel-frequency cepstral coefficients (MFCC) and MFCC delta features for a window of the voice input; and
      
      wherein the data outputs from the neural network classifier are viseme weights for the window.

23. A method for generating a user-customizable three-dimensional lipsyncing greeting card, comprising:
- receiving a user-defined input containing speech information;
  
  converting the input into a customized electronic greeting card comprising a three-dimensional visual speech animation comprising a lipsyncing character synchronized with an audio output corresponding to the input; and
  
  delivering the customized electronic greeting card to a recipient identified by the user.
- View Dependent Claims (24, 25, 26)
- - 24. A method according to claim 23, further comprising customizing the three-dimensional visual speech animation based on user-selected configurability options, wherein the user-selected configurability options include one or more configurability options selected from a group comprising:
25. A method according to claim 23, wherein delivering the customized electronic greeting card to a recipient identified by the user comprises:
- sending an electronic mail or on-line delivery notification to the recipient; and
  
  making the customized electronic greeting card available for download by the recipient at an internet site.
26. A method according to claim 25, wherein the customized electronic greeting card is made available to the recipient in either a movie format, a streaming media format, or a real-time rendering format.

27. A method of providing an electronic greeting card featuring a three-dimensional lipsyncing character, comprising:
- providing an Internet site;
  
  allowing a user to supply an input containing speech information to the internet site;
  
  converting the user-supplied input into an electronic greeting card comprising a three-dimensional lipsyncing character animated in synchronism with an audio output corresponding to the input; and
  
  delivering the electronic greeting card to a recipient specified by the user.
- View Dependent Claims (28, 29)
- - 28. A method according to claim 27, wherein delivering the electronic greeting card to a recipient comprises:
29. A method according to claim 27, wherein delivering the electronic greeting card to a recipient comprises:
- sending an email containing the electronic greeting card as an attachment to the recipient at an address specified by the user.

30. A method for producing a computer animated lipsyncing, comprising:
- providing a voice input into a first neural network to produce a phoneme output;
  
  providing the phoneme output from the first neural network to a second neural network to produce a viseme track output, wherein the viseme track output comprises a sequence of viseme weights over time, and wherein each viseme weight represents an amount of influence of the viseme over other visemes active at a given time; and
  
  using the viseme track output to generate an animated three-dimensional lipsyncing image in real-time in substantial synchronism with an audio speech output corresponding to the voice input.
- View Dependent Claims (31, 32, 33, 34, 35)
- - 31. A method according to claim 30 wherein using the viseme track output to generate an animated three-dimensional lipsyncing image comprises transmitting a data file containing the viseme output to an animation generator comprising a morphing engine and a rendering engine in real-time rendering format.
  - 32. A method according to claim 31, wherein the animation generator is located on a client computer and the voice input and viseme output are supplied from a server to the client.
  - 33. A method according to claim 31, wherein the animation generator is located on a server and wherein a file containing the animated three-dimensional lipsyncing image and the audio speech output is transmitted from the server to a client computer in a streaming media format.
  - 34. A method according to claim 30, further comprising:
35. A method according to claim 30, wherein the three-dimensional lipsyncing image and audio speech output are produced substantially simultaneously with the voice input.

36. An apparatus for producing a lipsyncing animation, comprising:
- a frame processor to identify frames of a voice input;
  
  a first neural network to receive the frames of the voice input and to identify a probable phoneme corresponding to each of the frames;
  
  a second neural network to receive the probable phonemes and identify viseme weights for one or more visemes active during each of the frames, wherein a viseme weight represents an amount of influence of a corresponding viseme over other visemes active during that frame; and
  
  a rendering engine to render a three-dimensional lipsyncing animation based on the viseme weights in substantial synchronization with an audio output corresponding to the voice input.
- View Dependent Claims (37, 38, 39, 40)
- - 37. An apparatus according to claim 36, further comprising a filter to filter the viseme weights to produce a filtered and smoothed viseme track for each of the active visemes.
  - 38. An apparatus according to claim 36, wherein the second neural network is trained based on viseme data including viseme deformabilities.
  - 39. An apparatus according to claim 36, further comprising a server comprising the first and second neural networks.
  - 40. An apparatus according to claim 36, further comprising a client computer comprising the rendering engine.

41. A method for producing a synthesized visual communication over a network comprising:
- receiving an input containing speech information into a first networked device;
  
  converting the input into phonetic speech components using a phoneme neural network;
  
  converting the phonetic speech components into weighted visual speech information, wherein the weighted visual speech information comprises information representing an amount of influence of a visual speech component over other visual speech components active at a given time;
  
  producing a lipsyncing animation based on the weighted visual speech information; and
  
  displaying the lipsyncing animation in substantial synchronism with an audibilization of a voice output corresponding to the input through a second networked device.
- View Dependent Claims (42, 43, 44, 45)
- - 42. A method according to claim 41, wherein producing the lipsyncing animation occurs on the first networked device, and wherein the lipsyncing animation and input are transmitted to the second networked device in a streaming media format.
  - 43. A method according to claim 41, wherein producing the lipsyncing animation occurs on the second networked device, and wherein the weighted visual speech information and the input are transmitted to the second networked device in a realtime rendering format.
  - 44. A method according to claim 41, wherein the input is a voice input and wherein displaying the lipsyncing animation and audibilizing the voice output occur substantially simultaneously with providing the voice input.
  - 45. A method according to claim 41, further comprising:

46. A method for providing a real-time synthetic communication comprising:
- providing inputs containing speech information into a first one or more of a plurality of devices;
  
  converting the inputs into viseme tracks, wherein the viseme tracks each comprise a sequence of viseme weights over time, and wherein each viseme weight represents an amount of influence of the viseme over other visemes active at a given time;
  
  producing a communication comprising a synthesized visual speech animation for each of the inputs based on the viseme tracks, said communication further comprising an audio output corresponding to the input; and
  
  outputting the communication through a second one or more of the devices.
- View Dependent Claims (47, 48, 49, 50)
- - 47. A method according to claim 46, wherein the communication is a broadcast communication.
  - 48. A method according to claim 46, wherein converting the inputs into viseme tracks and producing a communication comprising a synthesized visual speech animation and an audio output takes place on the first one or more of the devices.
  - 49. A method according to claim 48, wherein the communication is transmitted from the first one or more of the devices to the second one or more of the devices in streaming media format.
  - 50. A method according to claim 46, wherein converting the inputs into viseme tracks takes place on the first one or more of the devices, wherein the viseme tracks are transmitted from the first one or more of the devices to the second one or more of the devices, and wherein producing the communications takes place on the second one or more of the devices.

51. An email reader comprising:
- a phoneme neural classifier for converting an email text or an audio attachment into its constituent plurality of phonemes;
  
  a coarticulation engine to determine a weight of each of a plurality of visemes associated with each of the phonemes, wherein each viseme weight represents an amount of influence of the corresponding viseme over other visemes active at a given time;
  
  a morphing engine for morphing between target viseme models based on viseme weights;
  
  a text-to-audio speech synthesizer for synthesizing an audio voice output based on the phonemes from the email text; and
  
  a rendering engine for rendering an email lipsyncing animation based on data from the morphing engine.
- View Dependent Claims (52, 53, 54)
- - 52. An email reader according to claim 51, further comprising:
53. An email reader according to claim 51, further comprising:
- user-customization options to allow a user to select a lipsyncing character for the animation and a voice-type for the voice output.
54. An email reader according to claim 53, wherein the user-customization options are configured to allow independent selection of the character and voice type for each of a plurality of email senders.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sensory Fluent Speech Technologies, Inc. (Sensory Incorporated)
Original Assignee
Sensory Fluent Speech Technologies, Inc. (Sensory Incorporated)
Inventors
Sutton, Stephen, Vermeulen, Pieter
Primary Examiner(s)
Banks-Harold, Marsha D.
Assistant Examiner(s)
ABEBE, DANIEL DEMELASH

Application Number

US09/535,100
Time in Patent Office

1,096 Days
Field of Search

704/235, 704/258, 704/260, 704/270, 704/275, 704/276, 345/473, 345/676, 353/15, 353/16
US Class Current

704/260
CPC Class Codes

G06T 13/205   driven by audio data

G06T 13/40   of characters, e.g. humans,...

G10L 13/00   Speech synthesis; Text to s...

G10L 13/033   Voice editing, e.g. manipul...

G10L 13/08   Text analysis or generation...

G10L 2021/0135   Voice conversion or morphing

G10L 2021/105   Synthesis of the lips movem...

G10L 21/06   Transformation of speech in...

Methods and devices for producing and using synthetic visual speech based on natural coarticulation

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

324 Citations

54 Claims

Specification

Use Cases

Quick Links

Others

Methods and devices for producing and using synthetic visual speech based on natural coarticulation

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

324 Citations

54 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others