Prosody based audio/visual co-analysis for co-verbal gesture recognition
First Claim
1. A method of gestural behavior analysis, comprising the steps of:
- performing a training process using a combined audio/visual signal as a training data set, whereby prosodic audio features of said training data set are correlated with visual features of said training data set;
producing a statistical model based on results of said training process; and
applying said model to an actual data set to classify properties of gestural acts contained therein.
4 Assignments
0 Petitions
Accused Products
Abstract
The present method incorporates audio and visual cues from human gesticulation for automatic recognition. The methodology articulates a framework for co-analyzing gestures and prosodic elements of a person'"'"'s speech. The methodology can be applied to a wide range of algorithms involving analysis of gesticulating individuals. The examples of interactive technology applications can range from information kiosks to personal computers. The video analysis of human activity provides a basis for the development of automated surveillance technologies in public places such as airports, shopping malls, and sporting events.
126 Citations
22 Claims
-
1. A method of gestural behavior analysis, comprising the steps of:
-
performing a training process using a combined audio/visual signal as a training data set, whereby prosodic audio features of said training data set are correlated with visual features of said training data set;
producing a statistical model based on results of said training process; and
applying said model to an actual data set to classify properties of gestural acts contained therein. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system of gestural behavior analysis, comprising:
-
means for performing a training process using a combined audio/visual signal as a training data set, whereby prosodic audio features of said training data set are correlated with visual features of said training data set;
means for producing a statistical model based on results of said training process; and
means for applying said model to an actual data set to classify properties of gestural acts contained therein. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A computer program product for performing gestural behavior analysis, the computer program product comprising a computer-readable storage medium having computer-readable program code embodied in the medium, the computer-readable program code comprising:
-
computer-readable program code that performs a training process using a combined audio/visual signal as a training data set, whereby prosodic audio features of said training data set are correlated with visual features of said training data set;
computer-readable program code that produces a statistical model based on results of said training process; and
computer-readable program code that applies said model to an actual data set to classify properties of gestural acts contained therein. - View Dependent Claims (14, 15, 16, 17, 18)
-
-
19. A system for real-time continuous gesture recognition, comprising:
-
a) means for storing a database of reference gesture models, kinematical phases of gesture models, intonational representations of speech models, and combined gesture-speech models;
b) input means for receiving a sequence of images in real time, said images containing a gesticulating subject;
c) input means for receiving an audio signal from the gesticulating subject;
d) means for extracting a sequence of positional data of-extremities of the subject;
e) means for extracting a pitch sequence of voice data points from the audio signal;
f) processing means for co-analyzing visual and acoustic signals from a recording; and
g) processing means for utilizing a probabilistic framework to fuse gesture-speech co-occurrence information and visual gesture information to determine gesture occurrence. - View Dependent Claims (20, 21)
-
-
22. A method for real-time continuous gesture recognition, comprising:
-
a) employing a database of reference gesture models, kinematical phases of gesture models, intonational representations of speech models, and combined gesture-speech models;
b) receiving a sequence of images in real time, said images containing a gesticulating subject;
c) receiving an audio signal from the gesticulating subject;
d) extracting a sequence of positional data of extremities of the subject;
e) extracting a pitch sequence of voice data points from the audio signal;
f) co-analyzing visual and acoustic signals from a recording by transforming the sequence of positional data extracted from each image to a sequence of velocity and acceleration features, delimiting the sequence of the velocity and acceleration features through comparison to the kinematical phases of gestures, extracting a set of acoustically prominent segments from the pitch sequence, extracting a set of feature points from the acoustically prominent segments of pitch sequence, extracting a set of feature points from the delimited kinematical phases of gestures represented by velocity and acceleration features of the extremities movement, extracting a set of alignment measures of the feature points of pitch sequence and feature points of the extremities movement, comparing the alignment measures of the feature points of pitch sequence and feature points of the extremities movement, comparing the alignment measures of the feature points to co-occurrence gesture-speech models, comparing the velocity and acceleration features of the extremities movement to the reference gesture models; and
g) utilizing a probabilistic framework to fuse gesture-speech co-occurrence information and visual gesture information to determine gesture occurrence.
-
Specification