Speech detection fusing multi-class acoustic-phonetic, and energy features
First Claim
1. A computer implemented method for speech detection, the computer implemented method comprising:
- receiving media input;
generating a current frame of features from the media input;
classifying the current frame of features as a class selected from a set including at least a silence class, a disfluent class, and a speech class;
re-classifying the current frame as the speech class if the current frame is classified as the disfluent class and lies between a previous frame classified as the silence class and a next frame classified as the speech class; and
re-classifying the current frame as the silence class if the current frame is classified as the disfluent class and does not lie between a previous frame classified as the silence class and a next frame classified as the speech class.
2 Assignments
0 Petitions
Accused Products
Abstract
A speech detection system extracts a plurality of features from multiple input streams. In the acoustic model space, the tree of Gaussians in the model is pruned to include the active states. The Gaussians are mapped to Hidden Markov Model states for Viterbi phoneme alignment. Another feature space, such as the energy feature space is combined with the acoustic feature space. In the feature space, the features are combined and principal component analysis decorrelates the features to fewer dimensions, thus reducing the number of features. The Gaussians are also mapped to silence, disfluent phoneme, or voiced phoneme classes. The silence class is true silence and the voiced phoneme class is speech. The disfluent class may be speech or non-speech. If a frame is classified as disfluent, then that frame is re-classified as the silence class or the voiced phoneme class based on adjacent frames.
-
Citations
20 Claims
-
1. A computer implemented method for speech detection, the computer implemented method comprising:
-
receiving media input;
generating a current frame of features from the media input;
classifying the current frame of features as a class selected from a set including at least a silence class, a disfluent class, and a speech class;
re-classifying the current frame as the speech class if the current frame is classified as the disfluent class and lies between a previous frame classified as the silence class and a next frame classified as the speech class; and
re-classifying the current frame as the silence class if the current frame is classified as the disfluent class and does not lie between a previous frame classified as the silence class and a next frame classified as the speech class. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer program product comprising:
-
a computer usable medium having computer usable program code for speech detection, the computer usable program code comprising;
computer usable program code for receiving media input;
computer usable program code for generating a current frame of features from the media input;
computer usable program code for classifying the current frame of features as a class selected from a set including at least a silence class, a disfluent class, and a speech class; and
computer usable program code for re-classifying the current frame as the speech class if the current frame is classified as the disfluent class and lies between a previous frame classified as the silence class and a next frame classified as the speech class; and
computer usable program code for re-classifying the current frame as the silence class if the current frame is classified as the disfluent class and does not lie between a previous frame classified as the silence class and a next frame classified as the speech class. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A data processing system for speech detection, the data processing system comprising:
-
a memory having stored therein computer program code; and
a processor coupled to the memory, wherein the processor operates under control of the computer program code to receive media input;
generate a current frame of features from the media input;
classify the current frame of features as a class selected from a set including at least a silence class, a disfluent class, and a speech class;
re-classify the current frame as the speech class if the current frame is classified as the disfluent class and lies between a previous frame classified as the silence class and a next frame classified as the speech class; and
re-classify the current frame as the silence class if the current frame is classified as the disfluent class and does not lie between a previous frame classified as the silence class and a next frame classified as the speech class. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification