ROBUST VOICE ACTIVITY DETECTION IN ADVERSE ENVIRONMENTS
First Claim
1. A method for Voice Activity Detection (VAD) in an adverse environmental conditions, the method comprising:
- receiving an input signal from at least one source;
classifying said input signal into at least one of a silent signal block and a non-silent signal block by comparing temporal feature information;
sending said at least one of said silent signal block or said non-silent signal block to at least one of a Voice Endpoint Storing (VES) module or a total variation filtering module by comparing said temporal feature information to a plurality of thresholds;
determining endpoint information of at least one of a voice signal or a non-voice signal;
employing total variation filtering for enhancing speech features and suppressing noise levels in non-speech portions;
determining a noise floor in said total variation filtered signal;
determining feature information in autocorrelation of said total variation filtered signal sequence;
determining Binary-flag Storing, Merging and Deletion (BSMD) based on said a duration threshold on said determined feature information by a BSMD module;
determining voice endpoint correction based on said temporal feature information after said determined BSMD; and
outputting said input signal with said voice endpoint information.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and a system for robust voice activity detection under adverse environments are provided. The apparatus includes a controller for controlling a signal receiving module, a signal blocking module, a silent/non-silent classification module for discriminating silent blocks by comparing a temporal feature to a threshold, a total variation filtering module for enhancing voiced portions and reducing an effect of background noises, a frame division module for dividing a filtered signal into small frames, a residual processing module for estimating a noise floor, a silent/non-silent frame classification module, a voice/non-voice signal frame classification module based on autocorrelation features of a total variation filtered signal, a binary-flag merging and deletion module, a voice endpoint detection and correction module, and a voice endpoint storing/sending module. A decision-tree is arranged based on time and memory complexity of feature extraction methods. The system is able to determine voice region endpoints under different adverse environments.
23 Citations
25 Claims
-
1. A method for Voice Activity Detection (VAD) in an adverse environmental conditions, the method comprising:
-
receiving an input signal from at least one source; classifying said input signal into at least one of a silent signal block and a non-silent signal block by comparing temporal feature information; sending said at least one of said silent signal block or said non-silent signal block to at least one of a Voice Endpoint Storing (VES) module or a total variation filtering module by comparing said temporal feature information to a plurality of thresholds; determining endpoint information of at least one of a voice signal or a non-voice signal; employing total variation filtering for enhancing speech features and suppressing noise levels in non-speech portions; determining a noise floor in said total variation filtered signal; determining feature information in autocorrelation of said total variation filtered signal sequence; determining Binary-flag Storing, Merging and Deletion (BSMD) based on said a duration threshold on said determined feature information by a BSMD module; determining voice endpoint correction based on said temporal feature information after said determined BSMD; and outputting said input signal with said voice endpoint information. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system for Voice Activity Detection (VAD) in adverse environmental conditions, wherein said system is configured for:
-
receiving an input signal from at least one source; classifying said input signal into at least one of a silent signal block or a non-silent signal block by comparing temporal feature information; sending said at least one of said silent signal block or said non-silent signal block to at least one of a Voice Endpoint Storing (VES) module or a total variation filtering module by comparing said temporal feature information to the thresholds; determining endpoint information of at least one of a voice signal or non-voice signal; employing total variation filtering for enhancing speech features and suppressing noise levels in non-speech portions; determining a noise floor in said total variation filtered signal; determining feature information in autocorrelation of said total variation filtered signal sequence; determining Binary-Flag Storing Merging and Deletion (BSMD) based on the a duration threshold on said determined feature information; determining voice endpoint correction based on the temporal feature information after said determined BSMD; and outputting said input signal with said voice endpoint information. - View Dependent Claims (13, 14, 15, 16)
-
-
17. An apparatus for voice activity detection in adverse environmental conditions, wherein said apparatus comprises:
-
an integrated circuit further comprising at least one processor; at least one memory having a computer program code within said integrated circuit; said at least one memory and said computer program code configured to, with said at least one processor, cause said apparatus to; receive an input signal from at least one source; classify said input signal into at least one of a silent signal block or a non-silent signal block by comparing temporal feature information; send said at least one of said silent signal block or said non-silent signal block to at least one of a Voice Endpoint Storing (VES) module or a total variation filtering module by comparing said temporal feature information to thresholds; determine endpoint information of at least one of a voice signal or a non-voice signal by at least one of said VES module or said total variation filtering module; employ total variation filtering by said total variation filtering module for enhancing speech features and suppressing noise levels in non-speech portions; determine a noise floor in said total variation filtered signal domain; determine feature information in autocorrelation of said total variation filtered signal sequence; determine Binary-flag Storing, Merging and Deletion (BSMD) based on the duration threshold on said determined feature information by a BSMD module; determine voice endpoint correction based on the short-term temporal feature information after said determined binary-flag merging and deletion; and output said input signal with said voice endpoint information. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25)
-
Specification