PRIMARY SPEAKER IDENTIFICATION FROM AUDIO AND VIDEO DATA
First Claim
1. A method, comprising:
- receiving image data from a visual sensor of an information handling device;
receiving audio data from one or more microphones of the information handling device;
identifying, using one or more processors, human speech in the audio data;
identifying, using the one or more processors, a pattern of visual features in the image data associated with speaking;
matching, using the one or more processors, the human speech in the audio data with the pattern of visual features in the image data associated with speaking;
selecting, using the one or more processors, a primary speaker from among matched human speech;
assigning control to the primary speaker; and
performing one or more actions based on audio input of the primary speaker.
1 Assignment
0 Petitions
Accused Products
Abstract
An aspect provides a method, including: receiving image data from a visual sensor of an information handling device; receiving audio data from one or more microphones of the information handling device; identifying, using one or more processors, human speech in the audio data; identifying, using the one or more processors, a pattern of visual features in the image data associated with speaking; matching, using the one or more processors, the human speech in the audio data with the pattern of visual features in the image data associated with speaking; selecting, using the one or more processors, a primary speaker from among matched human speech; assigning control to the primary speaker; and performing one or more actions based on audio input of the primary speaker. Other aspects are described and claimed.
-
Citations
22 Claims
-
1. A method, comprising:
-
receiving image data from a visual sensor of an information handling device; receiving audio data from one or more microphones of the information handling device; identifying, using one or more processors, human speech in the audio data; identifying, using the one or more processors, a pattern of visual features in the image data associated with speaking; matching, using the one or more processors, the human speech in the audio data with the pattern of visual features in the image data associated with speaking; selecting, using the one or more processors, a primary speaker from among matched human speech; assigning control to the primary speaker; and performing one or more actions based on audio input of the primary speaker. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. An information handling device, comprising:
-
a visual sensor; one or more microphones; one or more processors; and a memory storing code executable by the one or more processors to; receive image data from the visual sensor; receive audio data from the one or more microphones; identify human speech in the audio data; identify a pattern of visual features in the image data associated with speaking; match the human speech in the audio data with the pattern of visual features in the image data associated with speaking; select a primary speaker from among matched human speech; assign control to the primary speaker; and perform one or more actions based on audio input of the primary speaker. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A program product, comprising:
-
a computer readable storage medium storing instructions executable by one or more processors, the instructions comprising; computer readable program code configured to receive image data from a visual sensor of an information handling device; computer readable program code configured to receive audio data from one or more microphones of the information handling device; computer readable program code configured to identify, using one or more processors, human speech in the audio data; computer readable program code configured to identify, using the one or more processors, a pattern of visual features in the image data associated with speaking; computer readable program code configured to match, using the one or more processors, the human speech in the audio data with the pattern of visual features in the image data associated with speaking; computer readable program code configured to select, using the one or more processors, a primary speaker from among matched human speech; computer readable program code configured to assign control to the primary speaker; and computer readable program code configured to perform one or more actions based on audio input of the primary speaker.
-
-
21. An information handling device, comprising:
-
a visual sensor; two or more microphones; one or more processors; and a memory storing code executable by the one or more processors to; receive image data from the visual sensor; receive audio data from the two or more microphones; identify human speech in the audio data; identify a pattern of visual features in the image data associated with speaking utilizing directional information in the audio data received to identify the pattern of visual features associated with speaking; match the human speech in the audio data with the pattern of visual features in the video data associated with speaking; identify matched human speech as a primary speaker; and perform one or more actions based on the primary speaker identified. - View Dependent Claims (22)
-
Specification