ROBUST VOICE ACTIVITY DETECTION IN ADVERSE ENVIRONMENTS

US 20140067388A1
Filed: 09/04/2013
Published: 03/06/2014
Est. Priority Date: 09/05/2012
Status: Abandoned Application

First Claim

Patent Images

1. A method for Voice Activity Detection (VAD) in an adverse environmental conditions, the method comprising:

receiving an input signal from at least one source;

classifying said input signal into at least one of a silent signal block and a non-silent signal block by comparing temporal feature information;

sending said at least one of said silent signal block or said non-silent signal block to at least one of a Voice Endpoint Storing (VES) module or a total variation filtering module by comparing said temporal feature information to a plurality of thresholds;

determining endpoint information of at least one of a voice signal or a non-voice signal;

employing total variation filtering for enhancing speech features and suppressing noise levels in non-speech portions;

determining a noise floor in said total variation filtered signal;

determining feature information in autocorrelation of said total variation filtered signal sequence;

determining Binary-flag Storing, Merging and Deletion (BSMD) based on said a duration threshold on said determined feature information by a BSMD module;

determining voice endpoint correction based on said temporal feature information after said determined BSMD; and

outputting said input signal with said voice endpoint information.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and a system for robust voice activity detection under adverse environments are provided. The apparatus includes a controller for controlling a signal receiving module, a signal blocking module, a silent/non-silent classification module for discriminating silent blocks by comparing a temporal feature to a threshold, a total variation filtering module for enhancing voiced portions and reducing an effect of background noises, a frame division module for dividing a filtered signal into small frames, a residual processing module for estimating a noise floor, a silent/non-silent frame classification module, a voice/non-voice signal frame classification module based on autocorrelation features of a total variation filtered signal, a binary-flag merging and deletion module, a voice endpoint detection and correction module, and a voice endpoint storing/sending module. A decision-tree is arranged based on time and memory complexity of feature extraction methods. The system is able to determine voice region endpoints under different adverse environments.

23 Citations

View as Search Results

25 Claims

1. A method for Voice Activity Detection (VAD) in an adverse environmental conditions, the method comprising:
- receiving an input signal from at least one source;
  
  classifying said input signal into at least one of a silent signal block and a non-silent signal block by comparing temporal feature information;
  
  sending said at least one of said silent signal block or said non-silent signal block to at least one of a Voice Endpoint Storing (VES) module or a total variation filtering module by comparing said temporal feature information to a plurality of thresholds;
  
  determining endpoint information of at least one of a voice signal or a non-voice signal;
  
  employing total variation filtering for enhancing speech features and suppressing noise levels in non-speech portions;
  
  determining a noise floor in said total variation filtered signal;
  
  determining feature information in autocorrelation of said total variation filtered signal sequence;
  
  determining Binary-flag Storing, Merging and Deletion (BSMD) based on said a duration threshold on said determined feature information by a BSMD module;
  
  determining voice endpoint correction based on said temporal feature information after said determined BSMD; and
  
  outputting said input signal with said voice endpoint information.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method as in claim 1, wherein said temporal feature information comprises at least one of an energy, a zerocrossing rate, and an energy envelope that represents a various nature of an audio signal.
  - 3. The method as in claim 1, wherein said method further comprises using lag of first zerocrossing point of autocorrelation sequence for detecting at least one of noise transients, and white Gaussian noise frames.
  - 4. The method as in claim 1, further comprising determining feature information, wherein said feature information comprises at least one of decaying energy ratios, an amplitude, a lag of a minimum peak amplitude, a lag of a maximum peak, and a zerocrossing rate from said autocorrelation of said signal for discriminating said voice signals from said non-voice signal.
  - 5. The method as in claim 4, further comprising determining said decaying energy ratios from said autocorrelation sequence to provide accurate characterizing of at least one of said voice signal and other background sounds.
  - 6. The method as in claim 1, wherein said sending further comprises receiving a signal block from a signal block division module and computing temporal features for said signal block.
  - 7. The method as in claim 1, further comprising estimating a noise floor from at least one of a total variation residual and said total variation filtered signal envelope which provides discrimination of said voice signal from said non-voice signal in said input signal.
  - 8. The method as in claim 1, further comprising performing a sampling rate conversion depending on the voice processing applications on said received input signal.
  - 9. The method as in claim 1, further comprising:
    - receiving a total variation filtered signal frame from a signal frame division module;
      
      computing said temporal feature information for said signal frame;
      
      comparing said feature information to thresholds;
      
      sending said non-silent signal frame to a Voice/Non-voice Frame Classification (VNFC) module;
      
      generating binary flag 0 information; and
      
      sending said binary flag 0 information to said BMDS module.
  - 10. The method as in claim 1, wherein said sending further comprises extracting feature information from said input signal by a Hierarchical Decision-Tree (HDT).
  - 11. The method as in claim 10, wherein said HDT sends at least one of a silent signal or a non-silent signal to at least one of the VES module or the total variation filtering module by comparing said temporal features to threshold.

12. A system for Voice Activity Detection (VAD) in adverse environmental conditions, wherein said system is configured for:
- receiving an input signal from at least one source;
  
  classifying said input signal into at least one of a silent signal block or a non-silent signal block by comparing temporal feature information;
  
  sending said at least one of said silent signal block or said non-silent signal block to at least one of a Voice Endpoint Storing (VES) module or a total variation filtering module by comparing said temporal feature information to the thresholds;
  
  determining endpoint information of at least one of a voice signal or non-voice signal;
  
  employing total variation filtering for enhancing speech features and suppressing noise levels in non-speech portions;
  
  determining a noise floor in said total variation filtered signal;
  
  determining feature information in autocorrelation of said total variation filtered signal sequence;
  
  determining Binary-Flag Storing Merging and Deletion (BSMD) based on the a duration threshold on said determined feature information;
  
  determining voice endpoint correction based on the temporal feature information after said determined BSMD; and
  
  outputting said input signal with said voice endpoint information.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The system as in claim 12, wherein said system comprises a Voice/Non-voice Frame Classification (VNFC) module that is configured for:
    - receiving said non-silent signal frame from a Silent/Non-silent Frame Classification (SNFC) module;
      
      computing a normalized single-sided Autocorrelation sequence of said non-silent signal frame;
      
      computing feature parameters for a predefined lag range of said autocorrelation sequence;
      
      comparing said features to said threshold;
      
      generating binary flag 0 information which is sent to a BSMD module;
      
      computing feature parameters for a predefined lag range of said autocorrelation sequence based on said comparison;
      
      comparing said features to said thresholds;
      
      generating at least one of binary-flag 1 or a binary flag-0; and
      
      sending said generated binary flag sequence information to said BSMD module.
  - 14. The system as in claim 12, wherein said parameters comprise at least one of a lag index of a first zerocrossing point, a zerocrossing rate, a lag index of a minimum point, an amplitude of a minimum point, a lag index of a maximum point, an amplitude of a maximum point, and decaying energy ratios.
  - 15. The system as in claim 12, wherein said BSMD module is configured for:
    - receiving said binary flag sequence information;
      
      finding locations of positive and negative transitions in said received binary flag sequence;
      
      calculating a difference in said locations; and
      
      comparing said difference with said threshold.
  - 16. The system as in claim 12, wherein said BSMD module is configured to perform at least one of replacing a binary block of 0 with a binary block of 1, and replacing a binary block of 1 with a binary block of 0 after said comparing.

17. An apparatus for voice activity detection in adverse environmental conditions, wherein said apparatus comprises:
- an integrated circuit further comprising at least one processor;
  
  at least one memory having a computer program code within said integrated circuit;
  
  said at least one memory and said computer program code configured to, with said at least one processor, cause said apparatus to;
  
  receive an input signal from at least one source;
  
  classify said input signal into at least one of a silent signal block or a non-silent signal block by comparing temporal feature information;
  
  send said at least one of said silent signal block or said non-silent signal block to at least one of a Voice Endpoint Storing (VES) module or a total variation filtering module by comparing said temporal feature information to thresholds;
  
  determine endpoint information of at least one of a voice signal or a non-voice signal by at least one of said VES module or said total variation filtering module;
  
  employ total variation filtering by said total variation filtering module for enhancing speech features and suppressing noise levels in non-speech portions;
  
  determine a noise floor in said total variation filtered signal domain;
  
  determine feature information in autocorrelation of said total variation filtered signal sequence;
  
  determine Binary-flag Storing, Merging and Deletion (BSMD) based on the duration threshold on said determined feature information by a BSMD module;
  
  determine voice endpoint correction based on the short-term temporal feature information after said determined binary-flag merging and deletion; and
  
  output said input signal with said voice endpoint information.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25)
- - 18. The apparatus as in claim 17, wherein said apparatus is configured to extract said temporal features from said input signal by a Signal Block Division (SBD) module.
  - 19. The apparatus as in claim 17, wherein said apparatus is configured to send silent signal or non-silent signal extracting feature information from said input signal using a Hierarchical Decision-Tree (HDT) in a Silent/Non-silent Block Classification (SNBC) module.
  - 20. The apparatus as in claim 17, wherein said apparatus is configured to send at least one of a silent signal or a non-silent signal to at least one of said VES module or said filtering module by comparing said temporal features to thresholds.
  - 21. The apparatus as in claim 17, wherein said apparatus is configured to output said input signal with said voice endpoint information after correcting said endpoint information by a Voice Endpoint Determination and Correction (VEDC) module.
  - 22. The apparatus as in claim 17, wherein said apparatus is configured for:
    - receiving audio data from at least one of a data acquisition module, audio communication, a storage device, and compressive sensing devices.
  - 23. The apparatus as in claim 17, wherein said apparatus is configured for:
    - using said total variation filtering to enhance said voice features and suppress noise levels in said non-voice signal.
  - 24. The apparatus as in claim 17, wherein said apparatus is configured for preventing pitch doubling and pitch halving errors.
  - 25. The apparatus as in claim 17, wherein said apparatus is configured for triggering said VAD in at least one of a manual mode or an automatic mode selected by a user.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Samsung Electronics Co. Ltd.
Original Assignee
Samsung Electronics Co. Ltd.
Inventors
Manikandan, M. Sabarimalai, TYAGI, Saurabh

Application Number

US14/017,983
Publication Number

US 20140067388A1
Time in Patent Office

Days
Field of Search
US Class Current

704/233
CPC Class Codes

G10L 15/20   Speech recognition techniqu...

G10L 21/0208   Noise filtering

G10L 25/78   Detection of presence or ab...

ROBUST VOICE ACTIVITY DETECTION IN ADVERSE ENVIRONMENTS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

23 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

ROBUST VOICE ACTIVITY DETECTION IN ADVERSE ENVIRONMENTS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

23 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links