SPEECH AND TEXT DRIVEN HMM-BASED BODY ANIMATION SYNTHESIS
First Claim
1. A method for synthesizing animation motions, comprising steps for:
- providing a set of one or more animation models, each animation model providing a set of probabilistic motions of one or more body parts learned from acoustic features and speech prosody information extracted from a set of one or more audio/video training signals comprising synchronized speech and body motions;
receiving an arbitrary speech input;
evaluating the speech input to extract a set of acoustic features and speech prosody information from the speech input;
predicting a sequence of the probabilistic motions which best explains the arbitrary speech input, by applying one or more of the set of animation models to the set of acoustic features and speech prosody information extracted from the arbitrary speech input; and
generating an animation sequence from the predicted sequence of the probabilistic motions.
2 Assignments
0 Petitions
Accused Products
Abstract
An “Animation Synthesizer” uses trainable probabilistic models, such as Hidden Markov Models (HMM), Artificial Neural Networks (ANN), etc., to provide speech and text driven body animation synthesis. Probabilistic models are trained using synchronized motion and speech inputs (e.g., live or recorded audio/video feeds) at various speech levels, such as sentences, phrases, words, phonemes, sub-phonemes, etc., depending upon the available data, and the motion type or body part being modeled. The Animation Synthesizer then uses the trainable probabilistic model for selecting animation trajectories for one or more different body parts (e.g., face, head, hands, arms, etc.) based on an arbitrary text and/or speech input. These animation trajectories are then used to synthesize a sequence of animations for digital avatars, cartoon characters, computer generated anthropomorphic persons or creatures, actual motions for physical robots, etc., that are synchronized with a speech output corresponding to the text and/or speech input.
-
Citations
20 Claims
-
1. A method for synthesizing animation motions, comprising steps for:
-
providing a set of one or more animation models, each animation model providing a set of probabilistic motions of one or more body parts learned from acoustic features and speech prosody information extracted from a set of one or more audio/video training signals comprising synchronized speech and body motions; receiving an arbitrary speech input; evaluating the speech input to extract a set of acoustic features and speech prosody information from the speech input; predicting a sequence of the probabilistic motions which best explains the arbitrary speech input, by applying one or more of the set of animation models to the set of acoustic features and speech prosody information extracted from the arbitrary speech input; and generating an animation sequence from the predicted sequence of the probabilistic motions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for generating audio/video animations from an arbitrary speech input, comprising:
-
a device for training one or more probabilistic motion models from one or more audio/video training signals comprising synchronized speech and body part motions; wherein training each probabilistic motion model further comprises learning a set of animation units corresponding to actual motions of specific body parts relative to acoustic features and speech prosody information extracted from one or more of the audio/video training signals; a device for extracting a set of acoustic features and speech prosody information from an arbitrary speech input; a device for predicting a sequence of the animation units which probabilistically explains the arbitrary speech input, by applying one or more of the set of motion models to the set of acoustic features and speech prosody information extracted from the arbitrary speech input; a device for generating an animation sequence from the predicted sequence of the animation units; and a device for constructing an audio/video animation of an avatar, said animation including the arbitrary speech input synchronized to the animation sequence. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A computer-readable medium having computer executable instructions stored therein for constructing an audio/video animation of an avatar as a function of an arbitrary text input, said instructions comprising:
-
learning one or more probabilistic motion models from one or more audio/video training signals, wherein each audio video training signal includes synchronized speech and body part motions of a human speaker; wherein learning each probabilistic motion model further comprises learning a set of animation units corresponding to actual motions of specific body parts of the human speaker as a predictive function of acoustic features and speech prosody information extracted from one or more of the audio/video training signals; receiving an arbitrary text input; synthesizing a speech signal from the arbitrary text input; extracting a set of acoustic features and speech prosody information from the synthesized speech signal; predicting a sequence of the animation units which probabilistically explains the synthesized speech signal, by applying one or more of the set of motion models to the set of acoustic features and speech prosody information extracted from the synthesized speech signal; and constructing an audio/video animation of an avatar from the predicted sequence of the animation units;
said animation including the synthesized speech signal synchronized to the animation. - View Dependent Claims (17, 18, 19, 20)
-
Specification