System and method for audio/video speaker detection
First Claim
Patent Images
1. A computer-implemented process for detecting speech, comprising the process actions of:
- inputting associated audio and video training data containing a person'"'"'s face that is periodically speaking; and
using said audio and video signals to train a time delay neural network to determine when a person is speaking, wherein said training comprises the following process actions;
computing audio features from said audio training data wherein said audio feature is the energy over an audio frame;
computing video features from said video training signals wherein said video feature is the degree to which said person'"'"'s mouth is open or closed; and
correlating said audio features and video features to determine when a person is speaking.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for detecting speech utilizing audio and video inputs. In one aspect, the invention collects audio data generated from a microphone device. In another aspect, the invention collects video data and processes the data to determine a mouth location for a given speaker. The audio and video are inputted into a time-delay neural network that processes the data to determine which target is speaking. The neural network processing is based upon a correlation to detected mouth movement from the video data and audio sounds detected by the microphone.
114 Citations
30 Claims
-
1. A computer-implemented process for detecting speech, comprising the process actions of:
-
inputting associated audio and video training data containing a person'"'"'s face that is periodically speaking; and using said audio and video signals to train a time delay neural network to determine when a person is speaking, wherein said training comprises the following process actions; computing audio features from said audio training data wherein said audio feature is the energy over an audio frame; computing video features from said video training signals wherein said video feature is the degree to which said person'"'"'s mouth is open or closed; and correlating said audio features and video features to determine when a person is speaking. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-readable medium encoded with computer executable instructions for use in detecting when a person in a synchronized audio video clip is speaking, said computer executable instructions comprising:
-
inputting one or more captured video and synchronized audio dips, segmenting said audio and video clips to remove portions of said video and synchronized audio clips not needed in determining if a speaker in the captured video and synchronized audio clips is speaking; extracting audio and video features in said captured video and synchronized audio dips to be used in determining if a speaker in the captured; and
wherein an audio feature is the energy over an audio frame and wherein said video feature is the openness of a person'"'"'s mouth;training a Time Delay Neural Network to determine when a person is speaking using said extracted audio and video features. - View Dependent Claims (14, 15, 16, 17, 18)
-
-
19. A computer-readable medium encoded with computer executable instructions for use in detecting when a person in a synchronized audio video clip is speaking, said computer executable instructions comprising:
-
inputting one or more captured video and synchronized audio clips, segmenting said audio and video clips to remove portions of said video and synchronized audio clips not needed in determining if a speaker in the captured video and synchronized audio clips is speaking; extracting audio and video features in said captured video and synchronized audio clips to be used in determining if a speaker in the captured;
wherein said audio feature is the energy over an audio frame and wherein said video feature is the openness of a person'"'"'s mouth, comprising sub-instructions for extracting video features comprising;using a face detector to locate a face in said video training signals; using the geometry of a typical face to estimate the location of a mouth and extracting a mouth image, wherein said sub-instruction for using the geometry of a typical face to estimate the location of a mouth and extracting a mouth image, comprises sub-instructions for; using a generic model of a head, designating the mouth region to be centered at a given distance of the head height from the top of the head model; designating the width of the mouth as a percentage of the height of the head; and designating the height of the mouth as a percentage of the height of the head model; stabilizing the mouth image to remove any translational motion of the mouth caused by head movement; using a Linear Discriminant Analysis (LDA) projection to determine if the mouth in the segmented mouth image is open or closed; and designating values of mouth openness wherein the values range from −
1 for the mouth being closed to +1 for the mouth being open; andtraining a Time Delay Neural Network to determine when a person is speaking using said extracted audio and video features.
-
-
20. A system for detecting a speaker in a video segment that is synchronized with associated audio, the system comprising
a general purpose computing device; - and
a computer readable medium encoded with instructions capable of being executed by a computing device, wherein the computing device is directed by the program modules of the computer program to, input one or more captured video and synchronized audio segments, segment said audio and video segments to remove portions of said video and synchronized audio segments not needed in determining if a speaker in the captured video and synchronized audio segments is speaking; extra audio and video features in said captured video and synchronized audio segments to be used in determining if a speaker in the captured video and synchronized audio segments is speaking, wherein said audio feature is the energy over an audio frame and said video feature is the openness of a person'"'"'s mouth in said video and synchronized audio segments; train a Time Delay Neural Network to determine when a person is speaking using said extracted audio and video features; input a captured video and synchronized audio clip for which it is desired to detect a person speaking; and use said trained Time Delay Neural Network to determine when a person is speaking in the captured video and synchronized audio segments for which it is desired to detect a person speaking. - View Dependent Claims (21, 22, 23)
- and
-
24. A computer-implemented process for detecting speech in an audio-visual sequence wherein more than one person is speaking at a time, comprising the process actions of:
-
inputting associated audio and video training data containing more than one person'"'"'s face wherein each person is periodically speaking at the same time as the other person or persons; and using said audio and video signals to train a time delay neural network to determine which person is speaking at a given time, wherein said training comprises the following process actions; computing audio features from said audio training data wherein said audio feature is the energy over an audio frame; computing video features from said video training signals to determine whether a given person'"'"'s mouth is open or closed; and correlating said audio features and video features to determine when a given person is speaking. - View Dependent Claims (25, 26, 27)
-
-
28. A computer-implemented process for detecting speech, comprising the process actions of:
-
inputting associated audio and video training data containing a person'"'"'s face that is periodically speaking; and using said audio and video signals to train a statistical learning engine to determine when a person is speaking, wherein said training comprises the following process actions; computing audio features from said audio training, wherein said audio feature is the acoustical energy over an entire audio frame; computing video features from said video training signals wherein said video feature is the degree to which said person'"'"'s mouth is open or closed; and correlating said audio features and video features to determine when a person is speaking. - View Dependent Claims (29, 30)
-
Specification