Video mixing using video speech detection

US 8,782,271 B1
Filed: 03/19/2012
Issued: 07/15/2014
Est. Priority Date: 03/19/2012
Status: Active Grant

First Claim

Patent Images

1. A method for video conferencing, comprising:

receiving, at one or more computers from at least some remote clients from a plurality of remote clients, information representing a plurality of encoded media frames;

receiving, at the one or more computers from at least some of the plurality of remote clients, a plurality of video-based speech activity signals each associated with a respective media frame from the plurality of encoded media frames, wherein each video-based speech activity signal is a value indicative of whether the respective media frame is associated with a participant who is currently speaking;

selecting, at the one or more computers, at least some media frames from the plurality of encoded media frames based on the video-based speech activity signals;

decoding the selected media frames;

generating a mixed media stream by combining the decoded media frames; and

transmitting, from the one or more computers to at least some remote clients from the plurality of remote clients, the mixed media stream.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for video conferencing includes receiving, at one or more computers from at least some remote clients from a plurality of remote clients, information representing a plurality of media frames. The method also includes receiving, at the one or more computers from at least some of the plurality of remote clients, a plurality of video-based speech activity signals each associated with a respective media frame from the plurality of media frames. The method further includes selecting, at the one or more computers, at least some media frames from the plurality of media frames based on the video-based speech activity signals, decoding the selected media frames, generating a mixed media stream by combining the decoded media frames, transmitting, from the one or more computers to at least some remote clients from the plurality of remote clients, the mixed media stream.

26 Citations

View as Search Results

26 Claims

1. A method for video conferencing, comprising:
- receiving, at one or more computers from at least some remote clients from a plurality of remote clients, information representing a plurality of encoded media frames;
  
  receiving, at the one or more computers from at least some of the plurality of remote clients, a plurality of video-based speech activity signals each associated with a respective media frame from the plurality of encoded media frames, wherein each video-based speech activity signal is a value indicative of whether the respective media frame is associated with a participant who is currently speaking;
  
  selecting, at the one or more computers, at least some media frames from the plurality of encoded media frames based on the video-based speech activity signals;
  
  decoding the selected media frames;
  
  generating a mixed media stream by combining the decoded media frames; and
  
  transmitting, from the one or more computers to at least some remote clients from the plurality of remote clients, the mixed media stream.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method for video conferencing of claim 1, wherein the information representing the plurality of encoded media frames includes video packets.
  - 3. The method for video conferencing of claim 2, wherein the video packets are Real-Time Transport Protocol (RTP) packets.
  - 4. The method for video conferencing of claim 2, wherein the video-based speech activity signals are each included in an extended packet header of one or more of the video packets.
  - 5. The method for video conferencing of claim 1, wherein the video-based speech activity signals are generated at respective ones of the remote clients based at least in part on a video component of the respective encoded media frame.
  - 6. The method for video conferencing of claim 1, wherein the video-based speech activity signals are generated at respective ones of the remote clients based at least in part on a video component of the respective encoded media frame using lip motion analysis.
  - 7. The method for video conferencing of claim 1, wherein the video-based speech activity signals are generated at respective ones of the remote clients based at least in part on a video component of the respective encoded media frame based on hand motion.
  - 8. The method for video conferencing of claim 1, wherein the video-based speech activity signals are generated at respective ones of the remote clients based at least in part on a video component of the respective media frame using lip motion analysis and based at least in part on an audio component of the respective encoded media frame using voice activity detection.
  - 9. The method for video conferencing of claim 1, wherein the value is at least one of a probability, a numeric value, or a Boolean value.

10. A method for video conferencing, comprising:
- receiving, at one or more computers from at least some remote clients from a plurality of remote clients, information representing a plurality of encoded media frames;
  
  receiving, at the one or more computers from at least some of the plurality of remote clients, a plurality of video-based speech activity signals each associated with a respective media frame from the plurality of encoded media frames, wherein each video-based speech activity signal is a value indicative of whether the respective media frame is associated with a participant who is currently speaking;
  
  selecting, at the one or more computers, at least some media frames from the plurality of encoded media frames based on the video-based speech activity signals; and
  
  transmitting, from the one or more computers to at least some remote clients from the plurality of remote clients, the selected media frames without decoding the selected media frames at the one or more computers.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The method for video conferencing of claim 10, wherein the information representing the plurality of encoded media frames includes video packets.
  - 12. The method for video conferencing of claim 11, wherein the video packets are Real-Time Transport Protocol (RTP) packets.
  - 13. The method for video conferencing of claim 11, wherein the video-based speech activity signals are each included in an extended packet header of one or more of the video packets.
  - 14. The method for video conferencing of claim 10, wherein the video-based speech activity signals are generated at respective ones of the remote clients based at least in part on a video component of the respective encoded media frame.
  - 15. The method for video conferencing of claim 10, wherein the video-based speech activity signals are generated at respective ones of the remote clients based at least in part on a video component of the respective encoded media frame using lip motion analysis.
  - 16. The method for video conferencing of claim 10, wherein the video-based speech activity signals are generated at respective ones of the remote clients based at least in part on a video component of the respective encoded media frame based on hand motion.
  - 17. The method for video conferencing of claim 10, wherein the video-based speech activity signals are generated at respective ones of the remote clients based at least in part on a video component of the respective encoded media frame using lip motion analysis and based at least in part on an audio component of the respective encoded media frame using voice activity detection.
  - 18. The method for video conferencing of claim 10, wherein the value is at least one of a probability, a numeric value, or a Boolean value.

19. A video conferencing apparatus, comprising:
- one or more computers configured to;
  
  receive, from at least some remote clients from a plurality of remote clients, information representing a plurality of encoded media frames;
  
  receive, from at least some of the plurality of remote clients, a plurality of video-based speech activity signals each associated with a respective encoded media frame from the plurality of media frames, wherein each video-based speech activity signal is a value indicative of whether the respective encoded media frame is associated with a participant who is currently speaking;
  
  select, at the one or more computers, at least some media frames from the plurality of media frames based on the video-based speech activity signals;
  
  decode the selected media frames;
  
  generate a mixed media stream by combining the decoded media frames; and
  
  transmit, from the one or more computers to at least some remote clients from the plurality of remote clients, the mixed media stream.
- View Dependent Claims (20, 21, 22)
- - 20. The video conferencing apparatus of claim 19, wherein the information representing the plurality of media frames includes Real-Time Transport Protocol (RTP) video packets, and the video-based speech activity signals are each included in an extended packet header of one or more of the RTP video packets.
  - 21. The video conferencing apparatus of claim 19, wherein the video-based speech activity signals are generated at respective ones of the remote clients based at least in part on a video component of the respective encoded media frame.
  - 22. The video conferencing apparatus of claim 19, wherein the value is at least one of a probability, a numeric value, or a Boolean value.

23. A non-transitory computer readable medium including program instructions executable by one or more processors that, when executed, cause the one or more processors to perform operations, the operations comprising:
- receiving, at one or more computers from at least some remote clients from a plurality of remote clients, information representing a plurality of encoded media frames;
  
  receiving, at the one or more computers from at least some of the plurality of remote clients, a plurality of video-based speech activity signals each associated with a respective encoded media frame from the plurality of media frames, wherein each video-based speech activity signal is a value indicative of whether the respective encoded media frame is associated with a participant who is currently speaking;
  
  selecting, at the one or more computers, at least some media frames from the plurality of media frames based on the video-based speech activity signals;
  
  decoding the selected media frames;
  
  generating a mixed media stream by combining the decoded media frames; and
  
  transmitting, from the one or more computers to at least some remote clients from the plurality of remote clients, the mixed media stream.
- View Dependent Claims (24, 25, 26)
- - 24. The non-transitory computer readable medium of claim 23, wherein the information representing the plurality of media frames includes Real-Time Transport Protocol (RTP) video packets, and the video-based speech activity signals are each included in an extended packet header of one or more of the RTP video packets.
  - 25. The non-transitory computer readable medium of claim 23, wherein the video-based speech activity signals are generated at respective ones of the remote clients based at least in part on a video component of the respective encoded media frame.
  - 26. The non-transitory computer readable medium of claim 23, wherein the value is at least one of a probability, a numeric value, or a Boolean value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Ellner, Lars Henrik
Primary Examiner(s)
Lim, Krisna

Application Number

US13/423,341
Time in Patent Office

848 Days
Field of Search

709/231
US Class Current

709/231
CPC Class Codes

H04L 12/1822   Conducting the conference, ...

H04L 65/00   Network arrangements, proto...

H04L 65/4053   without floor control

H04L 65/765   intermediate

H04M 3/567   Multimedia conference systems

H04N 7/147   Communication arrangements,...

H04N 7/15   Conference systems

Video mixing using video speech detection

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

26 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Video mixing using video speech detection

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

26 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links