Video mixing using video speech detection
First Claim
1. A method for video conferencing, comprising:
- receiving, at one or more computers from at least some remote clients from a plurality of remote clients, information representing a plurality of encoded media frames;
receiving, at the one or more computers from at least some of the plurality of remote clients, a plurality of video-based speech activity signals each associated with a respective media frame from the plurality of encoded media frames, wherein each video-based speech activity signal is a value indicative of whether the respective media frame is associated with a participant who is currently speaking;
selecting, at the one or more computers, at least some media frames from the plurality of encoded media frames based on the video-based speech activity signals;
decoding the selected media frames;
generating a mixed media stream by combining the decoded media frames; and
transmitting, from the one or more computers to at least some remote clients from the plurality of remote clients, the mixed media stream.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for video conferencing includes receiving, at one or more computers from at least some remote clients from a plurality of remote clients, information representing a plurality of media frames. The method also includes receiving, at the one or more computers from at least some of the plurality of remote clients, a plurality of video-based speech activity signals each associated with a respective media frame from the plurality of media frames. The method further includes selecting, at the one or more computers, at least some media frames from the plurality of media frames based on the video-based speech activity signals, decoding the selected media frames, generating a mixed media stream by combining the decoded media frames, transmitting, from the one or more computers to at least some remote clients from the plurality of remote clients, the mixed media stream.
26 Citations
26 Claims
-
1. A method for video conferencing, comprising:
-
receiving, at one or more computers from at least some remote clients from a plurality of remote clients, information representing a plurality of encoded media frames; receiving, at the one or more computers from at least some of the plurality of remote clients, a plurality of video-based speech activity signals each associated with a respective media frame from the plurality of encoded media frames, wherein each video-based speech activity signal is a value indicative of whether the respective media frame is associated with a participant who is currently speaking; selecting, at the one or more computers, at least some media frames from the plurality of encoded media frames based on the video-based speech activity signals; decoding the selected media frames; generating a mixed media stream by combining the decoded media frames; and transmitting, from the one or more computers to at least some remote clients from the plurality of remote clients, the mixed media stream. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for video conferencing, comprising:
-
receiving, at one or more computers from at least some remote clients from a plurality of remote clients, information representing a plurality of encoded media frames; receiving, at the one or more computers from at least some of the plurality of remote clients, a plurality of video-based speech activity signals each associated with a respective media frame from the plurality of encoded media frames, wherein each video-based speech activity signal is a value indicative of whether the respective media frame is associated with a participant who is currently speaking; selecting, at the one or more computers, at least some media frames from the plurality of encoded media frames based on the video-based speech activity signals; and transmitting, from the one or more computers to at least some remote clients from the plurality of remote clients, the selected media frames without decoding the selected media frames at the one or more computers. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A video conferencing apparatus, comprising:
-
one or more computers configured to; receive, from at least some remote clients from a plurality of remote clients, information representing a plurality of encoded media frames; receive, from at least some of the plurality of remote clients, a plurality of video-based speech activity signals each associated with a respective encoded media frame from the plurality of media frames, wherein each video-based speech activity signal is a value indicative of whether the respective encoded media frame is associated with a participant who is currently speaking; select, at the one or more computers, at least some media frames from the plurality of media frames based on the video-based speech activity signals; decode the selected media frames; generate a mixed media stream by combining the decoded media frames; and transmit, from the one or more computers to at least some remote clients from the plurality of remote clients, the mixed media stream. - View Dependent Claims (20, 21, 22)
-
-
23. A non-transitory computer readable medium including program instructions executable by one or more processors that, when executed, cause the one or more processors to perform operations, the operations comprising:
-
receiving, at one or more computers from at least some remote clients from a plurality of remote clients, information representing a plurality of encoded media frames; receiving, at the one or more computers from at least some of the plurality of remote clients, a plurality of video-based speech activity signals each associated with a respective encoded media frame from the plurality of media frames, wherein each video-based speech activity signal is a value indicative of whether the respective encoded media frame is associated with a participant who is currently speaking; selecting, at the one or more computers, at least some media frames from the plurality of media frames based on the video-based speech activity signals; decoding the selected media frames; generating a mixed media stream by combining the decoded media frames; and transmitting, from the one or more computers to at least some remote clients from the plurality of remote clients, the mixed media stream. - View Dependent Claims (24, 25, 26)
-
Specification