Low-complexity voice activity detection
First Claim
1. A low-complexity and low-power voice activity detector comprising:
- a first channel for processing a first audio stream and detecting activity in a first frequency band, wherein the first frequency band includes a first group of formant frequencies characteristic of vowels;
a second channel for processing the first audio stream and detecting activity in a second frequency band, wherein the second frequency band includes a second group of formant frequencies characteristic of vowels;
a third channel for processing the first audio stream, detecting activity in a third frequency band, and reducing false positives, wherein the third frequency band is substantially out-of-band with the first frequency band; and
a first decision module to detect that voice activity is present in the first audio stream if (1) the first channel and the second channel both detect activity, and (2) the third channel does not detect activity, and to detect that voice activity is not present if the third channel detects activity;
wherein the detection that voice activity is present triggers one or more processes to be executed by a system.
3 Assignments
0 Petitions
Accused Products
Abstract
Many processes for audio signal processing can benefit from voice activity detection, which aims to detect the presence of speech as opposed to silence or noise. The present disclosure describes, among other things, leveraging energy-based features of voice and insights on first and second formant frequencies of vowels to provide a low-complexity and low-power voice activity detector. A pair of two channels is provided whereby each channel is configured to detect voice activity in respective frequency bands of interest. Simultaneous activity detected in both channels can be a sufficient condition for determining that voice is present. More channels or pairs of channels can be used to detect different types of voices to improve detection and/or to detect voices present in different audio streams.
-
Citations
20 Claims
-
1. A low-complexity and low-power voice activity detector comprising:
-
a first channel for processing a first audio stream and detecting activity in a first frequency band, wherein the first frequency band includes a first group of formant frequencies characteristic of vowels; a second channel for processing the first audio stream and detecting activity in a second frequency band, wherein the second frequency band includes a second group of formant frequencies characteristic of vowels; a third channel for processing the first audio stream, detecting activity in a third frequency band, and reducing false positives, wherein the third frequency band is substantially out-of-band with the first frequency band; and a first decision module to detect that voice activity is present in the first audio stream if (1) the first channel and the second channel both detect activity, and (2) the third channel does not detect activity, and to detect that voice activity is not present if the third channel detects activity; wherein the detection that voice activity is present triggers one or more processes to be executed by a system. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A low-complexity and low-power detection apparatus for detecting an utterance of a pre-determined phrase, comprising:
-
a first channel for processing a first audio stream and detecting activity in a first frequency band, wherein the first frequency band includes formant frequencies characteristic of a first type of speaker uttering a first vowel of the pre-determined phrase; a second channel for processing the first audio stream and detecting activity in a second frequency band, wherein the second frequency band includes formant frequencies characteristic of a second type of speaker different from the first type of speaker uttering the first vowel; a third channel for processing the first audio stream, detecting activity in a third frequency band, and rejecting wide band noise, wherein the third frequency band is substantially out-of-band with the first frequency band; and a first decision module to detect the utterance of the pre-determined phrase voice activity is present in the first audio stream if (1) one or both the first channel and the second channel detect activity and (2) the third channel does not detect activity, and not detect the utterance of the pre-determined phrase if the third channel detects activity; wherein the detection of the utterance of the pre-determined phrase by the first decision module triggers a process to be performed by a processor.
-
-
9. A method for low-complexity and low-power voice activity detection with reduced false positives, the method comprising:
-
processing, in a first channel, a first audio stream and detecting sufficient variation in energy in a first frequency band, wherein the first frequency band includes a first group of formant frequencies characteristic of a first vowels; processing, in a second channel, the first audio stream and detecting sufficient variation in energy in a second frequency band, wherein the second frequency band includes a second group of formant frequencies characteristic of a second vowel; processing, in a third channel, the first audio stream and detecting activity in frequencies substantially out-of-band with the first frequency band, wherein the activity indicates wide band noise; determining that voice activity is present in the first audio stream if (1) both the first channel and the second channel detect sufficient variation in energy, and (2) the third channel detects insufficient activity; determining that voice activity is not present in the first audio stream if the third channel detects sufficient activity; and triggering a process to be performed by a processor in response to determining that voice activity is present. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification