Apparatus and Methods for the Detection of Emotions in Audio Interactions
First Claim
1. A method for detecting an at least one emotional state of an at least one speaker speaking in an at least one tested audio signal having a quality, the method comprising an emotion detection phase, the emotion detection phase comprising:
- a feature extraction step for extracting at least two feature vectors, each feature vector extracted from an at least one frame within the at least one tested audio signal;
a first model construction step for constructing a reference voice model from at least two first feature vectors, said model representing the speaker'"'"'s voice in neutral emotional state of the at least one speaker;
a second model construction step for constructing an at least one section voice model from at least two second feature vectors;
a distance determination step for determining an at least one distance between the reference voice model and the at least one section voice model; and
a section emotion score determination step for determining, by using the at least one distance, an at least one emotion score.
0 Assignments
0 Petitions
Accused Products
Abstract
An apparatus and method for detecting an emotional state of a speaker participating in an audio signal. The apparatus and method are based on the distance in voice features between a person being in an emotional state and the same person being in a neutral state. The apparatus and method comprise a training phase in which a training feature vector is determined, and an ongoing stage in which the training feature vector is used to determine emotional states in a working environment. Multiple types of emotions can be detected, and the method and apparatus are speaker-independent, i.e., no prior voice sample or information about the speaker is required.
163 Citations
18 Claims
-
1. A method for detecting an at least one emotional state of an at least one speaker speaking in an at least one tested audio signal having a quality, the method comprising an emotion detection phase, the emotion detection phase comprising:
-
a feature extraction step for extracting at least two feature vectors, each feature vector extracted from an at least one frame within the at least one tested audio signal; a first model construction step for constructing a reference voice model from at least two first feature vectors, said model representing the speaker'"'"'s voice in neutral emotional state of the at least one speaker; a second model construction step for constructing an at least one section voice model from at least two second feature vectors; a distance determination step for determining an at least one distance between the reference voice model and the at least one section voice model; and a section emotion score determination step for determining, by using the at least one distance, an at least one emotion score. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. An apparatus for detecting an emotional state of an at least one speaker speaking in an at least one audio signal, the apparatus comprises:
-
a feature extraction component for extracting at least two feature vectors, each feature vector extracted from an at least one frame within the at least one audio signal; a model construction component for constructing a model from at least two feature vectors; a distance determination component for determining a distance between the two models; and an emotion score determination component for determining, using said distance, an at least one emotion score for the at least one speaker within the at least one audio signal to be in an emotional state. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
-
-
18. A computer readable storage medium containing a set of instructions for a general purpose computer, the set of instructions comprising:
-
a feature extraction component for extracting at least two feature vectors, each feature vector extracted from an at least one frame within an at least one audio signal in which an at least one speaker is speaking; a model construction component for constructing a model from at least two feature vectors; a distance determination component for determining a distance between the two models; and an emotion score determination component for determining, using said distance, an at least one emotion score for the at least one speaker within the at least one audio signal to be in an emotional state.
-
Specification