AUGMENTED MULTI-TIER CLASSIFIER FOR MULTI-MODAL VOICE ACTIVITY DETECTION
First Claim
1. A method comprising:
- receiving, from a first classifier, a first voice activity indicator detected in a first modality for a human subject;
receiving, from a second classifier, a second voice activity indicator detected in a second modality for the human subject,wherein the first voice activity indicator and the second voice activity indicators are based on the human subject at a same time, and wherein the first modality and the second modality are different;
concatenating, via a third classifier, the first voice activity indicator and the second voice activity indicator with original features of the human subject, to yield a classifier output; and
determining voice activity based on the classifier output.
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed herein are systems, methods, and computer-readable storage media for detecting voice activity in a media signal in an augmented, multi-tier classifier architecture. A system configured to practice the method can receive, from a first classifier, a first voice activity indicator detected in a first modality for a human subject. Then, the system can receive, from a second classifier, a second voice activity indicator detected in a second modality for the human subject, wherein the first voice activity indicator and the second voice activity indicators are based on the human subject at a same time, and wherein the first modality and the second modality are different. The system can concatenate, via a third classifier, the first voice activity indicator and the second voice activity indicator with original features of the human subject, to yield a classifier output, and determine voice activity based on the classifier output.
66 Citations
20 Claims
-
1. A method comprising:
-
receiving, from a first classifier, a first voice activity indicator detected in a first modality for a human subject; receiving, from a second classifier, a second voice activity indicator detected in a second modality for the human subject, wherein the first voice activity indicator and the second voice activity indicators are based on the human subject at a same time, and wherein the first modality and the second modality are different; concatenating, via a third classifier, the first voice activity indicator and the second voice activity indicator with original features of the human subject, to yield a classifier output; and determining voice activity based on the classifier output. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
a processor; and a computer-readable medium having instructions which, when executed by the processor, cause the processor to perform operations comprising; receiving, from a first classifier, a first voice activity indicator detected in a first modality for a human subject; receiving, from a second classifier, a second voice activity indicator detected in a second modality for the human subject, wherein the first voice activity indicator and the second voice activity indicators are based on the human subject at a same time, and wherein the first modality and the second modality are different; concatenating, via a third classifier, the first voice activity indicator and the second voice activity indicator with original features of the human subject, to yield a classifier output; and determining voice activity based on the classifier output. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer-readable storage medium storing instructions, which cause a processor to perform operations comprising:
-
receiving, from a first classifier, first output generated based on a first modality of an input stream; receiving, from a second classifier, second output generated based on a second modality of the input stream, wherein the first output and the second output are associated with a same time, and wherein the first modality and the second modality are different; concatenating, via a third classifier, the first output and the second output with original features of the input stream, to yield a classifier output; and determining whether a desired activity is present in the input stream based on the classifier output. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification