Low-complexity voice activity detection

US 10,360,926 B2
Filed: 07/07/2015
Issued: 07/23/2019
Est. Priority Date: 07/10/2014
Status: Active Grant

First Claim

Patent Images

1. A low-complexity and low-power voice activity detector comprising:

a first channel for processing a first audio stream and detecting activity in a first frequency band, wherein the first frequency band includes a first group of formant frequencies characteristic of vowels;

a second channel for processing the first audio stream and detecting activity in a second frequency band, wherein the second frequency band includes a second group of formant frequencies characteristic of vowels;

a third channel for processing the first audio stream, detecting activity in a third frequency band, and reducing false positives, wherein the third frequency band is substantially out-of-band with the first frequency band; and

a first decision module to detect that voice activity is present in the first audio stream if (1) the first channel and the second channel both detect activity, and (2) the third channel does not detect activity, and to detect that voice activity is not present if the third channel detects activity;

wherein the detection that voice activity is present triggers one or more processes to be executed by a system.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Many processes for audio signal processing can benefit from voice activity detection, which aims to detect the presence of speech as opposed to silence or noise. The present disclosure describes, among other things, leveraging energy-based features of voice and insights on first and second formant frequencies of vowels to provide a low-complexity and low-power voice activity detector. A pair of two channels is provided whereby each channel is configured to detect voice activity in respective frequency bands of interest. Simultaneous activity detected in both channels can be a sufficient condition for determining that voice is present. More channels or pairs of channels can be used to detect different types of voices to improve detection and/or to detect voices present in different audio streams.

Citations

20 Claims

1. A low-complexity and low-power voice activity detector comprising:
- a first channel for processing a first audio stream and detecting activity in a first frequency band, wherein the first frequency band includes a first group of formant frequencies characteristic of vowels;
  
  a second channel for processing the first audio stream and detecting activity in a second frequency band, wherein the second frequency band includes a second group of formant frequencies characteristic of vowels;
  
  a third channel for processing the first audio stream, detecting activity in a third frequency band, and reducing false positives, wherein the third frequency band is substantially out-of-band with the first frequency band; and
  
  a first decision module to detect that voice activity is present in the first audio stream if (1) the first channel and the second channel both detect activity, and (2) the third channel does not detect activity, and to detect that voice activity is not present if the third channel detects activity;
  
  wherein the detection that voice activity is present triggers one or more processes to be executed by a system.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The low-complexity and low-power voice activity detector of claim 1, wherein:
    - the first frequency band includes a frequency of 400 Hertz; and
      
      the second frequency band includes a frequency of 2050 Hertz.
  - 3. The low-complexity and low-power voice activity detector of claim 1, wherein the first channel comprises:
    - a top tracker for tracking peaks of estimated energy of the first audio stream in the first frequency band to produce an output of the top tracker;
      
      a bottom tracker for tracking quiet periods the estimated energy of the first audio stream in the first frequency band to produce an output of the bottom tracker; and
      
      a modulation tracker for subtracting the output of the top tracker and the output of the bottom tracker to generate a modulation index.
  - 4. The low-complexity and low-power voice activity detector of claim 3, wherein the top tracker is configured to:
    - decrease the output of the top tracker at a first rate if the estimated energy is no longer at a peak; and
      
      decrease the output of the top tracker at a second rate faster than the first rate if the estimated energy has not returned to the peak for a particular period of time.
  - 5. The low-complexity and low-power voice activity detector of claim 3, wherein the bottom tracker is configured to:
    - increase the output of the bottom tracker at a first rate if the estimated energy is at a quiet period; and
      
      increase the output of the bottom tracker at a second rate faster than the first rate if the estimated energy continued to be in the quiet period for a particular period of time.
  - 6. The low-complexity and low-power voice activity detector of claim 3, wherein the first channel further comprises:
    - a comparator for comparing the modulation index against a threshold; and
      
      a low pass filtering module for processing the output of the comparator.
  - 7. The low-complexity and low-power voice activity detector of claim 1, further comprises an ambient noise generator configured to artificially generate pre-event audio samples based on the first audio stream.

8. A low-complexity and low-power detection apparatus for detecting an utterance of a pre-determined phrase, comprising:
- a first channel for processing a first audio stream and detecting activity in a first frequency band, wherein the first frequency band includes formant frequencies characteristic of a first type of speaker uttering a first vowel of the pre-determined phrase;
  
  a second channel for processing the first audio stream and detecting activity in a second frequency band, wherein the second frequency band includes formant frequencies characteristic of a second type of speaker different from the first type of speaker uttering the first vowel;
  
  a third channel for processing the first audio stream, detecting activity in a third frequency band, and rejecting wide band noise, wherein the third frequency band is substantially out-of-band with the first frequency band; and
  
  a first decision module to detect the utterance of the pre-determined phrase voice activity is present in the first audio stream if (1) one or both the first channel and the second channel detect activity and (2) the third channel does not detect activity, and not detect the utterance of the pre-determined phrase if the third channel detects activity;
  
  wherein the detection of the utterance of the pre-determined phrase by the first decision module triggers a process to be performed by a processor.

9. A method for low-complexity and low-power voice activity detection with reduced false positives, the method comprising:
- processing, in a first channel, a first audio stream and detecting sufficient variation in energy in a first frequency band, wherein the first frequency band includes a first group of formant frequencies characteristic of a first vowels;
  
  processing, in a second channel, the first audio stream and detecting sufficient variation in energy in a second frequency band, wherein the second frequency band includes a second group of formant frequencies characteristic of a second vowel;
  
  processing, in a third channel, the first audio stream and detecting activity in frequencies substantially out-of-band with the first frequency band, wherein the activity indicates wide band noise;
  
  determining that voice activity is present in the first audio stream if (1) both the first channel and the second channel detect sufficient variation in energy, and (2) the third channel detects insufficient activity;
  
  determining that voice activity is not present in the first audio stream if the third channel detects sufficient activity; and
  
  triggering a process to be performed by a processor in response to determining that voice activity is present.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 10. The method of claim 9, further comprising:
    - detecting sequential vowel sounds by applying a detection output of the first channel as a gate to a detection output of the second channel to ensure that detection of sufficient variation in energy in the second channel is preceded by detection of sufficient variation in energy in the first channel.
  - 11. The method of claim 10, wherein the gate has a time-out to ensure that the gate is temporary.
  - 12. The method of claim 11, wherein the time-out is weighted in time.
  - 13. The method of claim 9, further comprising:
    - detecting sequential vowels by, in response to detecting sufficient variation in energy in the first channel, adjusting a threshold parameter of the second channel.
  - 14. The method of claim 13, wherein adjusting the threshold parameter comprises:
    - temporarily relaxing the threshold parameter; and
      
      overtime tighten the threshold parameter as time passes further from the detection of sufficient variation in energy in the first channel.
  - 15. The method of claim 9, wherein:
    - processing in the first channel the first audio stream comprises filtering the first audio stream by a first biquad filter; and
      
      processing in the second channel the first audio stream comprises filtering the first audio stream by a second biquad filter.
  - 16. The method of claim 9, wherein:
    - processing in the first channel the first audio stream comprises filtering the first audio stream by a first Finite Impulse Response filter; and
      
      processing in the second channel the first audio stream comprises filtering the first audio stream by a second Finite Impulse Response filter;
      
      wherein the first and second Finite Impulse Response filter comprises tabs which respond to (1) formant frequencies characteristic of the first and second vowel respectively, and (2) a timing relationship between the first and second vowel in a predetermined word or phrase.
  - 17. The method of claim 9, wherein processing the first audio stream in the first channel comprises:
    - generating a first filtered audio stream by passing frequencies in the first frequency band and attenuating frequencies outside of the first frequency band.
  - 18. The method of claim 9, wherein processing in the first channel comprises:
    - determining root mean squared values representing energy in the first frequency band of the first audio stream.
  - 19. The method of claim 9, wherein processing in the first channel comprises:
    - tracking peaks in energy in the first frequency band in a top tracker; and
      
      tracking quiet periods of energy in the first frequency band in a bottom tracker;
      
      wherein output values of the top tracker and the bottom tracker adapts to past behavior of the top tracker and bottom tracker.
  - 20. The method of claim 19, wherein processing in the first channel comprises:
    - subtracting a difference in output between the top tracker and the bottom tracker to determine a modulation index; and
      
      comparing the modulation index against a threshold to detect the variation in energy.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Analog Devices International Unlimited Company (Analog Devices, Inc.)
Original Assignee
Analog Devices Global Unlimited Company (Analog Devices, Inc.)
Inventors
Mortensen, Mikael M., Berthelsen, Kim Spetzler, Adams, Robert, Milia, Andrew
Primary Examiner(s)
Zhu, Richard Z

Application Number

US15/321,743
Publication Number

US 20170133041A1
Time in Patent Office

1,477 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 25/15   the extracted parameters be...

G10L 25/18   the extracted parameters be...

G10L 25/78   Detection of presence or ab...

G10L 25/81   for discriminating voice fr...

G10L 25/84   for discriminating voice fr...

G10L 25/87   Detection of discrete point...

Low-complexity voice activity detection

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Low-complexity voice activity detection

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links