System and process for adding high frame-rate current speaker data to a low frame-rate video

US 7,362,350 B2
Filed: 04/30/2004
Issued: 04/22/2008
Est. Priority Date: 04/30/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented process for facilitating the identification of a current speaker in each frame of a low frame-rate video, comprising using a computer to perform the following process actions:

obtaining audio and video of an event having multiple people in attendance;

transmitting the video of the event at a prescribed frame rate to a client computing device;

continuously transmitting the audio of the event to the client computing device;

tracking the movements of the attendees and recording their positions when each video frame is transmitted and their subsequent positions between the transmission of the video frames;

periodically identifying which of the attendees is currently speaking at a rate significantly faster than the prescribed video frame rate;

periodically generating an indicator which comprises the location of the attendee who is currently speaking as depicted in the last-transmitted video frame regardless of their current position;

transmitting each of said indicators to the client computing device for use in highlighting a region in the last-transmitted video frame depicting the attendee at the location specified in the last-transmitted indicator.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and process for highlighting the current speaker on an on-going basis in each frame of a low frame-rate video of an event having multiple people in attendance, such as a video teleconference, is presented. In general, this is accomplished by periodically identifying an attendee that is currently speaking at a rate substantially faster than the video frame rate, and for each frame of the video updating the frame to highlight the attendee currently speaking. More particularly, an audio/visual (A/V) source provides separate video, audio, and current speaker data streams to a client computing device. The client device then uses these data streams to render and display the video and to periodically update the frame being displayed to highlight the current speaker depicted therein.

99 Citations

View as Search Results

33 Claims

1. A computer-implemented process for facilitating the identification of a current speaker in each frame of a low frame-rate video, comprising using a computer to perform the following process actions:
- obtaining audio and video of an event having multiple people in attendance;
  
  transmitting the video of the event at a prescribed frame rate to a client computing device;
  
  continuously transmitting the audio of the event to the client computing device;
  
  tracking the movements of the attendees and recording their positions when each video frame is transmitted and their subsequent positions between the transmission of the video frames;
  
  periodically identifying which of the attendees is currently speaking at a rate significantly faster than the prescribed video frame rate;
  
  periodically generating an indicator which comprises the location of the attendee who is currently speaking as depicted in the last-transmitted video frame regardless of their current position;
  
  transmitting each of said indicators to the client computing device for use in highlighting a region in the last-transmitted video frame depicting the attendee at the location specified in the last-transmitted indicator.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The process of claim 1, further comprising the process actions of:
    - each time an indicator is generated which comprises the location of the attendee who is currently speaking, generating a separate indicator for each attendee who is not currently speaking which comprises the location of the non-speaking attendee as depicted in the last-transmitted video frame regardless of the attendee'"'"'s current position; and
      
      transmitting each of said indicators associated with non-speaking attendees to the client computing device for use in un-highlighting a region in the last-transmitted video frame depicting the non-speaking attendee at the location specified in the indicator whenever that region is highlighted based on a previously transmitted indicator specifying the attendee under consideration was then speaking.
  - 3. The process of claim 2, wherein the process action of periodically generating an indicator which comprises the location of the attendee who is currently speaking, comprises the action of generating said indicator at a prescribed interval.
  - 4. The process of claim 2, wherein the process action of periodically generating an indicator which comprises the location of the attendee who is currently speaking, comprises the actions of:
    - generating an indicator immediately after the generation of each video frame; and
      
      thereafter,generating indicators only when either an attendee who was not speaking at the time the last indicator was generated begins speaking or an attendee that was speaking at the time the last indicator was generated stops speaking.
  - 5. The process of claim 1, wherein the video, audio and indicators are transmitted directly to the client computing device.
  - 6. The process of claim 1, wherein the video, audio and indicators are transmitted to the client computing device via a computer network.
  - 7. The process of claim 2, wherein the process action of transmitting each of said indicators to the client computing device, comprises an action of producing a current speaker data stream which is separate from the audio and video data streams.
  - 8. The process of claim 7, wherein each indicator takes the form of a tuple comprising a location parameter which specifies the location in the last-transmitted video frame where an attendee associated with the tuple is depicted.
  - 9. The process of claim 8, wherein each tuple further comprises a speaker status parameter which specifies if the attendee associated with the tuple is currently speaking or not.
  - 10. The process of claim 8, wherein each tuple further comprises a time parameter which specifies the time the tuple was generated.

11. A system for facilitating the identification of a current speaker in each frame of a low frame-rate video, comprising:
- a general purpose computing device;
  
  at least one video camera;
  
  at least one microphone; and
  
  a computer program comprising program modules executable by the computing device, comprising,a video stream creation module which generates a data stream of video frames at a prescribed frame rate,an audio data stream creation module which generates a continuous stream of audio data;
  
  a current speaker detection module which,periodically identifies the current speaker among the persons depicted in each video frame of the video stream at a rate substantially faster than the video frame rate, andtracks the movements of the persons depicted in each video frame between the generation of said frames so as to equate their current location with their original location when the video frame was generated;
  
  a current speaker data module which generates a data stream comprising current speaker indicators, each of which specifies,the location of a person depicted in a video frame associated with the indicator, andwhether the person whose location is specified is currently speaking or not.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
- - 12. The system of claim 11, wherein the current speaker indicators each further specify the time the indicator was generated.
  - 13. The system of claim 11, wherein the computer program further comprises a video encoder module which encodes the video data stream for transfer to a client computing device or to storage for later transfer to the client computing device.
  - 14. The system of claim 11, wherein the computer program further comprises an audio encoder module which encodes the audio data stream for transfer to a client computing device or to storage for later transfer to the client computing device.
  - 15. The system of claim 11, wherein the computer program further comprises a sound source localization module which uses the audio data stream to identify the location of persons depicted in the video frames of the video data stream.
  - 16. The system of claim 15, wherein the current speaker detection module comprises sub-modules for using the video data stream and person location information generated by the sound source localization module to periodically identify the current speaker among the persons depicted in each video frame of the video stream at a rate substantially faster than the video frame rate, and track the movements of the persons depicted in each video frame between the generation of said frames so as to equate their current location with their original location when the video frame was generated.
  - 17. The system of claim 11, wherein the current speaker data module comprises a tuple generator sub-module which generates the data stream of indicators in the form of tuples, each of which comprises,a location parameter identifying the location of a person depicted in a video frame associated with the tuple as depicted in that frame,a speaker status parameter identifying whether the person whose location is specified by the location parameter of the tuple is currently speaking or not, anda time parameter specifying the time the tuple was generated.
  - 18. The system of claim 11, wherein the computer program further comprises a current speaker data stream compression module which compresses the current speaker data prior to transfer to the client computing device or storage.

19. A computer-implemented process for highlighting the current speaker in each frame of a low frame-rate video of an event having multiple people in attendance, comprising using a computer to perform the following process actions:
- obtaining the low frame-rate video of the event;
  
  obtaining a continuous audio stream of the event;
  
  obtaining periodically generated indicators, each of which comprises the location of the attendee who is currently speaking in a last-obtained video frame, wherein said indicators are available at a rate significantly faster than the video frame rate;
  
  for each indicator obtained which relates to the last-obtained video frame, highlighting a region in that video frame based on the location of the current speaker specified in the indicator under consideration, wherein said highlighting visually distinguishes a current speaker from all other attendees depicted in the last-obtained video frame.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
- - 20. The process of claim 19, wherein the process action of highlighting a region in the last-obtained video frame based on the location of the current speaker specified in an indicator under consideration for each indicator obtained which relates to the last-obtained video frame, comprises the actions of:
    - determining if the indicator is valid;
      
      whenever it is determined the indicator under consideration is valid, determining if the indicator applies to the last-obtained video frame;
      
      whenever it is determined the indicator under consideration applies to the last-obtained video frame, highlighting a region of the last-obtained video frame which is associated with the location of the current speaker specified in the indicator under consideration.
  - 21. The process of claim 20, wherein the indicators further comprise data specifying the time the indicator was generated, and wherein the process action of determining if an indicator is valid, comprises the actions of:
    - ascertaining if the indicator specifies a generation time that is later than the last indicator considered prior to the current indicator; and
      
      whenever the indicator specifies a generation time that is later than the last indicator considered prior to the current indicator under consideration, designating the current indicator as a valid indicator.
  - 22. The process of claim 20, wherein the frame rate at which the video frames are transmitted is known to the client computing device, and wherein the indicators further comprise data specifying the time the indicator was generated and each video frame comprises a frame number indicating the order in which it was generated in comparison to the other video frames, and wherein the process action of determining if the indicator under consideration applies to the last-obtained video frame, comprises the actions of:
    - ascertaining whether the last-obtained video frame has the frame number of the next expected video frame and whether it was received at or after an expected arrival time based on the known video frame transmission rate,whenever the last-obtained video frame has the frame number of the next expected video frame and was received at or after the expected arrival time, determining whether the generation time of the indicator under consideration is later than the expected arrival time of the last-obtained video frame and prior to the expected arrival time of the next video frame;
      
      whenever it is determined that the generation time of the indicator under consideration is later than the expected arrival time of the last-obtained video frame and prior to the expected arrival time of the next video frame, designating that the indicator under consideration applies to the last-obtained video frame.
  - 23. The process of claim 20, wherein the indicators further comprise data specifying the video frame number to which it applies and each video frame comprises a frame number indicating the order in which it was generated in comparison to the other video frames, and wherein the process action of determining if the indicator under consideration applies to the last-obtained video frame, comprises the actions of:
    - ascertaining whether the last-obtained video frame has a frame number that matches the frame number specified by the indicator; and
      
      whenever it is determined that the last-obtained video frame has a frame number that matches the frame number specified by the indicator, designating that the indicator under consideration applies to the last-obtained video frame.
  - 24. The process of claim 20, further comprising the process actions of:
    - determining if the indicator under consideration applies to the next expected video frame;
      
      whenever the indicator under consideration applies to the next expected video frame, saving the indicator and waiting until the next video frame is obtained;
      
      when the next video frame is obtained, highlighting a region of that video frame which is associated with the location of the current speaker specified in the indicator.
  - 25. The process of claim 24, wherein the frame rate at which the video frames are transmitted is known to the client computing device, and wherein the indicators further comprise data specifying the time the indicator was generated and each video frame comprises a frame number indicating the order in which it was generated in comparison to the other video frames, and wherein the process action of determining if the indicator under consideration applies to the next expected video frame, comprises the actions of:
    - determining whether the generation time of the indicator under consideration is later than the expected arrival time of the next expected video frame and prior to the expected arrival time of the video frame after that;
      
      whenever it is determined that the generation time of the indicator under consideration is later than the expected arrival time of the next expected video frame and prior to the expected arrival time of the video frame after that, designating that the indicator under consideration applies to the next expected video frame.
  - 26. The process of claim 24, wherein the indicators further comprise data specifying the video frame number to which it applies and each video frame comprises a frame number indicating the order in which it was generated in comparison to the other video frames, and wherein the process action of determining if the indicator under consideration applies to the next expected video frame, comprises the actions of:
    - ascertaining whether the frame number specified by the indicator matches an expected frame number of the next expected video frame; and
      
      whenever it is determined that the frame number specified by the indicator matches the expected frame number of the next expected video frame, designating that the indicator under consideration applies to the next expected video frame.
  - 27. The process of claim 19, wherein the process action of highlighting a region in a video frame based on the location of the current speaker specified in an indicator, comprises the actions of:
    - identifying a region in the video frame, which the indicator under consideration is associated with, that has a prescribed size and shape and which has a prescribed geometric relationship to the specified location of the current speaker;
      
      modifying the appearance of all or a part of the region in a prescribed manner so as to visually distinguish a current speaker from all other attendees depicted in the video frame.

28. A system for highlighting the current speaker in each frame of a low frame-rate video of an event having multiple people in attendance, comprising:
- a general purpose computing device;
  
  a computer program comprising program modules executable by the computing device, comprising,a video input module which obtains the low frame-rate video of the event,an audio input module which obtains a continuous audio stream of the event,a current speaker data input module which obtains periodically generated indicators, each of which comprises the location of the attendee depicted in the last-obtained video frame and indicates if that attendee is currently speaking or not, wherein said indicators are available at a rate significantly faster than the video frame rate, anda highlighting module which highlights a region in the last-obtained video frame that is associated with an attendee that an obtained indicator applicable to the last-obtained video frame specifies is currently speaking, based on the location of the attendee specified in that indicator, wherein said highlighting visually distinguishes a current speaker from all other attendees depicted in the last-obtained video frame that are not currently speaking.
- View Dependent Claims (29, 30, 31, 32, 33)
- - 29. The system of claim 28, wherein the highlighting module comprises sub-modules which for each indicator obtained:
    - determines if the indicator is valid;
      
      whenever it is determined the indicator under consideration is valid, determines if the indicator applies to the last-obtained video frame;
      
      whenever it is determined the indicator under consideration applies to the last-obtained video frame, determines from the indicator under consideration if the attendee associated with that indicator is currently speaking or not;
      
      whenever it is determined the attendee associated with the indicator under consideration is speaking, determines if a region of the last-obtained video frame which is associated with the location of the current speaker specified in the indicator under consideration is highlighted or not; and
      
      whenever it is determined the region of the last-obtained video frame which is associated with the location of the current speaker specified in the indicator under consideration is not highlighted, highlights that region in a prescribed manner.
  - 30. The system of claim 29, wherein the highlighting module further comprises sub-modules which for each indicator obtained that it is determined the attendee associated with the indicator is not speaking:
    - determines if a region of the last-obtained video frame which is associated with the location of the non-speaking attendee specified in the indicator under consideration is highlighted or not; and
      
      whenever it is determined the region of the last-obtained video frame which is associated with the location of the non-speaking attendee is highlighted, un-highlights that region.
  - 31. The system of claim 29, which whenever it is determined the indicator under consideration is valid, further comprising the process actions of:
    - determining if the indicator under consideration applies to the next expected video frame;
      
      whenever the indicator under consideration applies to the next expected video frame, saving the indicator and waiting until the next video frame is obtained;
      
      when the next video frame is obtained, determining from the saved indicator if the attendee associated with that indicator is currently speaking or not;
      
      whenever it is determined the attendee associated with the saved indicator is speaking, determining if a region of the newly-obtained video frame which is associated with the location of the current speaker specified in the saved indicator is highlighted or not;
      
      whenever it is determined the region of the newly-obtained video frame which is associated with the location of the current speaker specified in the saved indicator is not highlighted, highlighting that region in a prescribed manner;
      
      whenever it is determined the attendee associated with the saved indicator is not speaking, determining if a region of the newly-obtained video frame which is associated with the location of the non-speaking attendee specified in the saved indicator is highlighted or not; and
      
      whenever it is determined the region of the newly-obtained video frame which is associated with the location of the non-speaking attendee is highlighted, un-highlighting that region.
  - 32. The system of claim 28, wherein the low frame-rate video is encoded, and wherein the computer program further comprises a module for decoding the video prior to the highlighting module processing it.
  - 33. The system of claim 28, wherein the audio is encoded, and wherein the computer program further comprises a module for decoding the audio.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Cutler, Ross
Primary Examiner(s)
Ramakrishnaiah; Melur

Application Number

US10/837,138
Publication Number

US 20050243166A1
Time in Patent Office

1,453 Days
Field of Search

348 1401- 1409, 348/14.1, 348/14.11, 348/14.12, 348/14.13, 370/260, 370/261, 709/204, 715/753, 715/755
US Class Current

348/14.12
CPC Class Codes

H04N 7/147 Communication arrangements,...

H04N 7/15 Conference systems

System and process for adding high frame-rate current speaker data to a low frame-rate video

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

99 Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

System and process for adding high frame-rate current speaker data to a low frame-rate video

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

99 Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links