Speaker anticipation
First Claim
Patent Images
1. A method of anticipating a video switch to accommodate a new speaker in a video conference comprising:
- analyzing, by at least one speaker anticipation model, a real time video stream captured by a camera local to a first videoconference endpoint;
predicting, by the at least one speaker anticipation model, that a new speaker is about to speak, the at least one speaker anticipation model trained by a guided learning dataset from historical video feeds derived from a series of labeled video frames from the historical video feeds of the new speaker, each of the labeled video frames including one of two different labels based on audio;
sending video of the new speaker to a conferencing server in response to a request for the video of the new speaker from the conferencing server; and
distributing, via the conferencing server, the video of the new speaker to a second videoconference endpoint,wherein,each of the labeled video frames is default-labeled as a pre-speaking frame except for any of the labeled video frames including audio from a same endpoint, andeach of the labeled video frames including the audio from the same endpoint is labeled as speaking frames.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods are disclosed for anticipating a video switch to accommodate a new speaker in a video conference comprising a real time video stream captured by a camera local to a first videoconference endpoint is analyzed according to at least one speaker anticipation model. The speaker anticipation model predicts that a new speaker is about to speak. Video of the anticipated new speaker is sent to the conferencing server in response to a request for the video on the anticipated new speaker from the conferencing server. Video of the anticipated new speaker is distributed to at least a second videoconference endpoint.
-
Citations
15 Claims
-
1. A method of anticipating a video switch to accommodate a new speaker in a video conference comprising:
-
analyzing, by at least one speaker anticipation model, a real time video stream captured by a camera local to a first videoconference endpoint; predicting, by the at least one speaker anticipation model, that a new speaker is about to speak, the at least one speaker anticipation model trained by a guided learning dataset from historical video feeds derived from a series of labeled video frames from the historical video feeds of the new speaker, each of the labeled video frames including one of two different labels based on audio; sending video of the new speaker to a conferencing server in response to a request for the video of the new speaker from the conferencing server; and distributing, via the conferencing server, the video of the new speaker to a second videoconference endpoint, wherein, each of the labeled video frames is default-labeled as a pre-speaking frame except for any of the labeled video frames including audio from a same endpoint, and each of the labeled video frames including the audio from the same endpoint is labeled as speaking frames. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A videoconference system for anticipating a video switch to accommodate a new speaker in a video conference, the system comprising:
-
a processor; a storage device in communication with the processor; a videoconference server in communication with the processor; and a camera in communication with the processor, a first videoconference endpoint participating in a multi-endpoint meeting hosted by the videoconference server, the first videoconference endpoint configured, via the storage device and the processor, to; analyze a real time video stream captured by the camera local to the first videoconference endpoint according to at least one speaker anticipation model; predict, by the at least one speaker anticipation model, that a new speaker is about to speak to yield a prediction, the at least one speaker anticipation model trained by a guided learning dataset from historical video feeds derived from a series of labeled video frames from the historical video feeds of the new speaker, each of the labeled video frames including one of two different labels based on audio; send video of the new speaker to the videoconference server in response to a request for the video of the new speaker from the videoconference server; and distribute the video of the new speaker to a second videoconference endpoint, wherein each of the labeled video frames is default-labeled as a pre-speaking frame except for any of the labeled video frames including audio from a same endpoint, and each of the labeled video frames including the audio from the same endpoint is labeled as speaking frames. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A method of anticipating a video switch to accommodate a new speaker in a video conference, the method comprising the steps of:
-
analyzing a real time video stream captured by a camera local to a first videoconference endpoint according to at least one speaker anticipation model; predicting, by the at least one speaker anticipation model, that a new speaker is about to speak, the at least one speaker anticipation model trained by a guided learning dataset from historical video feeds derived from a series of labeled video frames from the historical video feeds, each of the labeled video frames including one of two different labels based on audio; sending video of the new speaker to a conferencing server in response to a request for the video of the new speaker from the conferencing server; distributing the video of the new speaker to a second videoconference endpoint; ranking a plurality of signals captured during a real time video conference according to a semantic representation model; ignoring signals having a ranking below a threshold; and sending a request for high resolution video to the first videoconference endpoint when the ranking of at least one of the plurality of signals is above the threshold, wherein each of the labeled video frames is default-labeled as a pre-speaking frame except for any of the labeled video frames including audio from a same endpoint, and each of the labeled video frames including the audio from the same endpoint is labeled as speaking frames. - View Dependent Claims (14, 15)
-
Specification