Using speaker clustering to switch between different camera views in a video conference system
First Claim
1. A method comprising:
- at a video conference endpoint including a microphone and a camera;
detecting a talker position of a talker based on audio detected by the microphone;
detecting one or more faces and face positions in video captured by the camera;
determining whether a detected talker position matches any detected face position;
if there is no match, performing speaker clustering across a speech segment in the detected audio and speech segments in previously detected audio;
if the speaker clustering indicates the talker is known, determining whether the detected talker position matches a previous closeup position of the talker, wherein the previous closeup position is a previously detected talker position that matches a previously detected face; and
based on results of the determining whether the detected talker position matches the previous closeup position of the talker, framing either a closeup camera view on the previous closeup position or a non-closeup camera view based on the detected talker position.
1 Assignment
0 Petitions
Accused Products
Abstract
A video conference endpoint includes one or more cameras to capture video of different views and a microphone array to sense audio. One or more closeup views are defined. The endpoint detects faces in the captured video and active audio sources from the sensed audio. The endpoint detects any active talker having detected face positions that coincide with detected active audio sources, and also uses speaker clustering to detect whether any active talker is associated with a previously stored closeup views. Based on whether an active talker is detected in any of the stored closeup views, the endpoint switches between capturing video of one of the closeup views and a best overview of the participants in the conference room.
-
Citations
20 Claims
-
1. A method comprising:
-
at a video conference endpoint including a microphone and a camera; detecting a talker position of a talker based on audio detected by the microphone; detecting one or more faces and face positions in video captured by the camera; determining whether a detected talker position matches any detected face position; if there is no match, performing speaker clustering across a speech segment in the detected audio and speech segments in previously detected audio; if the speaker clustering indicates the talker is known, determining whether the detected talker position matches a previous closeup position of the talker, wherein the previous closeup position is a previously detected talker position that matches a previously detected face; and based on results of the determining whether the detected talker position matches the previous closeup position of the talker, framing either a closeup camera view on the previous closeup position or a non-closeup camera view based on the detected talker position. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. An apparatus comprising:
-
one or more cameras controllable to capture video of different views; a microphone array to sense audio; and a processor, coupled to the one or more cameras and the microphone array, to; detect a talker position of a talker based on audio detected by the microphone; detect one or more faces and face positions in video captured by the camera; determine whether a detected talker position matches any detected face position; if there is no match, perform speaker clustering across a speech segment in the detected audio and speech segments in previously detected audio; if the speaker clustering indicates the talker is known, determine whether the detected talker position matches a previous closeup position of the talker, wherein the previous closeup position is a previously detected talker position that matches a previously detected face; and based on results of determining whether the detected talker position matches the previous closeup position of the talker, frame either a closeup camera view on the previous closeup position or a non-closeup camera view based on the detected talker position. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A non-transitory processor readable medium storing instructions that, when executed by a processor, cause the processor to:
-
detect a talker position of a talker based on audio detected by one or more microphones; detect one or more faces and face positions in video captured by one or more cameras; determine whether the detected talker position matches any detected face position; if there is no match, perform speaker clustering across a speech segment in the detected audio and speech segments in previously detected audio; if the speaker clustering indicates the talker is known, determine whether the detected talker position matches a previous closeup position of the talker, wherein the previous closeup position is a previously detected talker position that matches a previously detected face; and based on results of determining whether the detected talker position matches the previous closeup position of the talker, frame either a closeup camera view on the previous closeup position or a non-closeup camera view based on the detected talker position. - View Dependent Claims (17, 18, 19, 20)
-
Specification