NEURAL NETWORK VOICE ACTIVITY DETECTION EMPLOYING RUNNING RANGE NORMALIZATION

US 20160093313A1
Filed: 09/25/2015
Published: 03/31/2016
Est. Priority Date: 09/26/2014
Status: Active Grant

First Claim

Patent Images

1. A method of obtaining normalized voice activity detection features from an audio signal comprising the steps of:

at a computing system, dividing an audio signal into a sequence of time frames;

computing one or more voice activity detection feature of the audio signal for each of the time frames;

computing running estimates of minimum and maximum values of the one or more voice activity detection feature of the audio signal for each of the time frames;

computing input ranges of the one or more voice activity detection feature by comparing the running estimates of the minimum and maximum values of the one or more voice activity detection feature of the audio signal for each of the time frames; and

mapping the one or more voice activity detection feature of the audio signal for each of the time frames from the input ranges to one or more desired target range to obtain one or more normalized voice activity detection feature.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A “running range normalization” method includes computing running estimates of the range of values of features useful for voice activity detection (VAD) and normalizing the features by mapping them to a desired range. Running range normalization includes computation of running estimates of the minimum and maximum values of VAD features and normalizing the feature values by mapping the original range to a desired range. Smoothing coefficients are optionally selected to directionally bias a rate of change of at least one of the running estimates of the minimum and maximum values. The normalized VAD feature parameters are used to train a machine learning algorithm to detect voice activity and to use the trained machine learning algorithm to isolate or enhance the speech component of the audio data.

45 Citations

View as Search Results

22 Claims

1. A method of obtaining normalized voice activity detection features from an audio signal comprising the steps of:
- at a computing system, dividing an audio signal into a sequence of time frames;
  
  computing one or more voice activity detection feature of the audio signal for each of the time frames;
  
  computing running estimates of minimum and maximum values of the one or more voice activity detection feature of the audio signal for each of the time frames;
  
  computing input ranges of the one or more voice activity detection feature by comparing the running estimates of the minimum and maximum values of the one or more voice activity detection feature of the audio signal for each of the time frames; and
  
  mapping the one or more voice activity detection feature of the audio signal for each of the time frames from the input ranges to one or more desired target range to obtain one or more normalized voice activity detection feature.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, wherein the one or more features of the audio signal indicative of spoken voice data includes one or more of full-band energy, low-band energy, ratios of energies measured in primary and reference microphones, variance values, spectral centroid ratios, spectral variance, variance of spectral differences, spectral flatness, and zero crossing rate.
  - 3. The method of claim 1, wherein the one or more normalized voice activity detection feature is used to produce an estimate of the likelihood of spoken voice data.
  - 4. The method of claim 1, further comprising applying the one or more normalized voice activity detection feature to a machine learning algorithm to produce a voice activity detection estimate indicating at least one of a binary speech/non-speech designation and a likelihood of speech activity.
  - 5. The method of claim 4, further comprising using the voice activity detection estimate to control an adaptation rate of one or more adaptive filters.
  - 6. The method of claim 1, wherein the time frames are overlapping within the sequence of time frames.
  - 7. The method of claim 1, further comprising post-processing the one or more normalized voice activity detection feature, including at least one of smoothing, quantizing, and thresholding.
  - 8. The method of claim 1, wherein the one or more normalized voice activity detection feature is used to enhance the audio signal by one or more of noise reduction, adaptive filtering, power level difference computation, and attenuation of non-speech frames.
  - 9. The method of claim 1, further comprising producing a clarified audio signal comprising the spoken voice data substantially free of non-voice data.
  - 10. The method of claim 1, wherein the one or more normalized voice activity detection feature is used to train a machine learning algorithm to detect speech.
  - 11. The method of claim 1, wherein computing running estimates of minimum and maximum values of the one or more voice activity detection feature comprises applying asymmetrical exponential averaging to the one or more voice activity detection feature.
  - 12. The method of claim 11 further comprising setting smoothing coefficients to correspond to time constants selected to produce one of a gradual change and a rapid change in one of smoothed minimum value estimates and smoothed maximum value estimates.
  - 13. The method of claim 12, wherein the smoothing coefficients are selected such that continuous updating of a maximum value estimate responds rapidly to higher voice activity detection feature values and decays more slowly in response to lower voice activity detection feature values.
  - 14. The method of claim 12, wherein the smoothing coefficients are selected such that continuous updating of a minimum value estimate responds rapidly to lower voice activity detection feature values and increases slowly in response to higher voice activity detection feature values.
  - 15. The method of claim 1, wherein the mapping is performed according to the following formula:
    - normalizedFeatureValue=2×
      
      (newFeatureValue−
      
      featureFloor)/(featureCeiling−
      
      featureFloor)−
      
      1.
  - 16. The method of claim 1, wherein the mapping is performed according to the following formula:
    - normalizedFeatureValue=(newFeatureValue−
      
      featureFloor)/(featureCeiling−
      
      featureFloor).
  - 17. The method of claim 1, wherein the computing input ranges of the one or more voice activity detection feature is performed by subtracting the running estimates of the minimum values from the running estimates of the maximum values.

18. A method of normalizing voice activity detection features comprising the steps of:
- segmenting an audio signal into a sequence of time frames;
  
  computing running minimum and maximum value estimates for voice activity detection features;
  
  computing input ranges by comparing the running minimum and maximum value estimates; and
  
  normalizing the voice activity detection features by mapping the voice activity detection features from the input ranges to one or more desired target ranges.
- View Dependent Claims (19, 20, 21)
- - 19. The method of claim 18, wherein computing running minimum and maximum value estimates comprises selecting smoothing coefficients to establish a directionally-biased rate of change for at least one of the running minimum and maximum value estimates.
  - 20. The method of claim 19, wherein the smoothing coefficients are selected such that the running maximum value estimate responds more quickly to higher maximum values and more slowly to lower maximum values.
  - 21. The method of claim 19, wherein the smoothing coefficients are selected such that the running minimum value estimate responds more quickly to lower minimum values and more slowly to higher minimum values.

22. A computer-readable medium storing a computer program for performing a method for identifying voice data within an audio signal, the computer-readable medium comprising:
- computer storage media; and
  
  computer-executable instructions stored on the computer storage media, which computer-executable instructions, when executed by a computing system, are configured to cause the computing system to;
  
  compute a plurality of voice activity detection features;
  
  compute running estimates of minimum and maximum values of the voice activity detection features;
  
  compute input ranges of the voice activity detection features by comparing the running estimates of the minimum and maximum values; and
  
  map the voice activity detection features from the input ranges to one or more desired target ranges to obtain normalized voice activity detection features.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cirrus Logic Incorporated
Original Assignee
Cypher, LLC
Inventors
Vickers, Earl

Granted Patent

US 9,953,661 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 2015/0636   Threshold criteria for the ...

G10L 21/0224   Processing in the time domain

G10L 21/0264   characterised by the type o...

G10L 25/30   using neural networks

G10L 25/60   for measuring the quality o...

G10L 25/78   Detection of presence or ab...

G10L 25/84   for discriminating voice fr...

NEURAL NETWORK VOICE ACTIVITY DETECTION EMPLOYING RUNNING RANGE NORMALIZATION

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

45 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

NEURAL NETWORK VOICE ACTIVITY DETECTION EMPLOYING RUNNING RANGE NORMALIZATION

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

45 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links