SPEECH AND TEXT DRIVEN HMM-BASED BODY ANIMATION SYNTHESIS

US 20100082345A1
Filed: 09/26/2008
Published: 04/01/2010
Est. Priority Date: 09/26/2008
Status: Active Grant

First Claim

Patent Images

1. A method for synthesizing animation motions, comprising steps for:

providing a set of one or more animation models, each animation model providing a set of probabilistic motions of one or more body parts learned from acoustic features and speech prosody information extracted from a set of one or more audio/video training signals comprising synchronized speech and body motions;

receiving an arbitrary speech input;

evaluating the speech input to extract a set of acoustic features and speech prosody information from the speech input;

predicting a sequence of the probabilistic motions which best explains the arbitrary speech input, by applying one or more of the set of animation models to the set of acoustic features and speech prosody information extracted from the arbitrary speech input; and

generating an animation sequence from the predicted sequence of the probabilistic motions.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An “Animation Synthesizer” uses trainable probabilistic models, such as Hidden Markov Models (HMM), Artificial Neural Networks (ANN), etc., to provide speech and text driven body animation synthesis. Probabilistic models are trained using synchronized motion and speech inputs (e.g., live or recorded audio/video feeds) at various speech levels, such as sentences, phrases, words, phonemes, sub-phonemes, etc., depending upon the available data, and the motion type or body part being modeled. The Animation Synthesizer then uses the trainable probabilistic model for selecting animation trajectories for one or more different body parts (e.g., face, head, hands, arms, etc.) based on an arbitrary text and/or speech input. These animation trajectories are then used to synthesize a sequence of animations for digital avatars, cartoon characters, computer generated anthropomorphic persons or creatures, actual motions for physical robots, etc., that are synchronized with a speech output corresponding to the text and/or speech input.

Citations

20 Claims

1. A method for synthesizing animation motions, comprising steps for:
- providing a set of one or more animation models, each animation model providing a set of probabilistic motions of one or more body parts learned from acoustic features and speech prosody information extracted from a set of one or more audio/video training signals comprising synchronized speech and body motions;
  
  receiving an arbitrary speech input;
  
  evaluating the speech input to extract a set of acoustic features and speech prosody information from the speech input;
  
  predicting a sequence of the probabilistic motions which best explains the arbitrary speech input, by applying one or more of the set of animation models to the set of acoustic features and speech prosody information extracted from the arbitrary speech input; and
  
  generating an animation sequence from the predicted sequence of the probabilistic motions.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1 wherein the animation sequence is used to construct an audio/video animation of an avatar, said animation including the arbitrary speech input synchronized to the animation sequence.
  - 3. The method of claim 1 wherein the animation sequence is used to construct a robot control sequence, said robot control sequence including the arbitrary speech input synchronized to robotic motions corresponding to the animation sequence.
  - 4. The method of claim 1 wherein the acoustic features and speech prosody information includes one or more of speech F0 measurements, speech energy measurements, speech timing measurements, part-of-speech, and speech semantics.
  - 5. The method of claim 1 wherein each of the probabilistic motions correspond to one or more separate speech levels, including sentences, phrases, words, phonemes, and sub-phonemes.
  - 6. The method of claim 1 wherein each of the probabilistic motions include one or more of a mouth, nose, eyes, eyebrows, ears, face, head, fingers, hands, arms, legs, feet, torso, spine, and skeletal elements of a body.
  - 7. The method of claim 1 further comprising steps for synthesizing the arbitrary speech input from an arbitrary text input.
  - 8. The method of claim 7 further comprising translating the arbitrary text input from a first language to a second language prior to synthesizing the arbitrary speech input from the translated arbitrary text input.
  - 9. The method of claim 7 further comprising a user interface for associating one or more emotion tags with one or more portions of the arbitrary text input, said emotion tags being used to change emotional characteristics of the synthesized arbitrary speech input.
  - 10. The method of claim 1 wherein the arbitrary speech input is synthesized from a real speech input of a user.

11. A system for generating audio/video animations from an arbitrary speech input, comprising:
- a device for training one or more probabilistic motion models from one or more audio/video training signals comprising synchronized speech and body part motions;
  
  wherein training each probabilistic motion model further comprises learning a set of animation units corresponding to actual motions of specific body parts relative to acoustic features and speech prosody information extracted from one or more of the audio/video training signals;
  
  a device for extracting a set of acoustic features and speech prosody information from an arbitrary speech input;
  
  a device for predicting a sequence of the animation units which probabilistically explains the arbitrary speech input, by applying one or more of the set of motion models to the set of acoustic features and speech prosody information extracted from the arbitrary speech input;
  
  a device for generating an animation sequence from the predicted sequence of the animation units; and
  
  a device for constructing an audio/video animation of an avatar, said animation including the arbitrary speech input synchronized to the animation sequence.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The system of claim 11 wherein each of the animation units correspond to one or more separate speech levels, including sentences, phrases, words, phonemes, and sub-phonemes.
  - 13. The system of claim 11 further comprising a device for automatically synthesizing the arbitrary speech input from an arbitrary text input.
  - 14. The system of claim 13 further comprising a device for translating the arbitrary text input from a first language to a second language prior to synthesizing the arbitrary speech input from the translated arbitrary text input.
  - 15. The system of claim 11 further comprising one or more multi-point video conferencing devices, wherein each of two or more users is represented to each of one or more other users by a real-time audio/video animation of an avatar generated from real-time arbitrary speech inputs of each corresponding user.

16. A computer-readable medium having computer executable instructions stored therein for constructing an audio/video animation of an avatar as a function of an arbitrary text input, said instructions comprising:
- learning one or more probabilistic motion models from one or more audio/video training signals, wherein each audio video training signal includes synchronized speech and body part motions of a human speaker;
  
  wherein learning each probabilistic motion model further comprises learning a set of animation units corresponding to actual motions of specific body parts of the human speaker as a predictive function of acoustic features and speech prosody information extracted from one or more of the audio/video training signals;
  
  receiving an arbitrary text input;
  
  synthesizing a speech signal from the arbitrary text input;
  
  extracting a set of acoustic features and speech prosody information from the synthesized speech signal;
  
  predicting a sequence of the animation units which probabilistically explains the synthesized speech signal, by applying one or more of the set of motion models to the set of acoustic features and speech prosody information extracted from the synthesized speech signal; and
  
  constructing an audio/video animation of an avatar from the predicted sequence of the animation units;
  
  said animation including the synthesized speech signal synchronized to the animation.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The computer-readable medium of claim 16 further comprising a user interface for associating one or more emotion tags with one or more portions of the arbitrary text input, said emotion tags being used to change emotional characteristics of the synthesized speech signal.
  - 18. The computer-readable medium of claim 16 wherein each of the animation units correspond to one or more separate speech levels, including sentences, phrases, words, phonemes, and sub-phonemes.
  - 19. The computer-readable medium of claim 16 wherein the arbitrary text input is translated from a first language to a second language prior to synthesizing the speech signal.
  - 20. The computer-readable medium of claim 16 wherein the arbitrary text input is extracted from an email message, and wherein the audio/video animation of the avatar represents a digital avatar speaking the text of the email message while exhibiting body part motions corresponding to animation units modeled after one or more real human speakers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Wang, Lijuan, Soong, Frank Kao-Ping, Ma, Lei

Granted Patent

US 8,224,652 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G06T 13/205   driven by audio data

G10L 13/00   Speech synthesis; Text to s...

G10L 21/06   Transformation of speech in...

SPEECH AND TEXT DRIVEN HMM-BASED BODY ANIMATION SYNTHESIS

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SPEECH AND TEXT DRIVEN HMM-BASED BODY ANIMATION SYNTHESIS

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links