Catch-up video buffering

US 9,462,230 B1
Filed: 03/31/2014
Issued: 10/04/2016
Est. Priority Date: 03/31/2014
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

receiving audio-visual (AV) data as part of a video conference call;

detecting a face of a participant of the video conference call using an imaging device associated with a device of the participant;

determining that the participant'"'"'s face is oriented in a direction toward a display associated with the device of the participant by applying image processing to a first image captured by a camera;

outputting live content from the video conference call to the display at substantially a same time as the first content is received based on determining that the participant'"'"'s face is oriented in the direction toward the display;

determining, at a first time, that the participant is no longer observing the conference call based on image processing of a second image captured by the camera failing to detect that the participant is facing the display;

determining, at a second time, that the participant is again observing the conference call by determining that the participant is facing the display, based on image processing of a third image captured by the camera, wherein the second time is after the first time;

storing content from the AV data after the first time;

performing speech-recognition processing on an audio portion of the AV data;

identifying one or more sections of the audio portion of the AV data, the one or more sections comprising one or more of;

silences, pauses, spoken filler words, non-lexical utterances, and false starts;

outputting stored content to the display, wherein the outputting occurs after the second time and wherein the stored content is output at an accelerated rate until the stored content reaches the live content at a third time, and wherein the one or more sections are omitted when the stored content is output at the accelerated rate; and

outputting the live content at a normal rate after the third time.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system determines if someone watching a live video feed looks or moves away from a display screen, and when their attention is back on the display, provides an accelerated recap of the content that they missed. The video component of the feed may be shown as a series of selected still images or clips from the original feed, while audio and/or text captioning is output at an accelerated rate. The rate may be adaptively adjusted to maintain a consistent speed, and superfluous content may be omitted. When the recap catches up to the live feed, output returns to regular speed.

50 Citations

View as Search Results

18 Claims

1. A method, comprising:
- receiving audio-visual (AV) data as part of a video conference call;
  
  detecting a face of a participant of the video conference call using an imaging device associated with a device of the participant;
  
  determining that the participant'"'"'s face is oriented in a direction toward a display associated with the device of the participant by applying image processing to a first image captured by a camera;
  
  outputting live content from the video conference call to the display at substantially a same time as the first content is received based on determining that the participant'"'"'s face is oriented in the direction toward the display;
  
  determining, at a first time, that the participant is no longer observing the conference call based on image processing of a second image captured by the camera failing to detect that the participant is facing the display;
  
  determining, at a second time, that the participant is again observing the conference call by determining that the participant is facing the display, based on image processing of a third image captured by the camera, wherein the second time is after the first time;
  
  storing content from the AV data after the first time;
  
  performing speech-recognition processing on an audio portion of the AV data;
  
  identifying one or more sections of the audio portion of the AV data, the one or more sections comprising one or more of;
  
  silences, pauses, spoken filler words, non-lexical utterances, and false starts;
  
  outputting stored content to the display, wherein the outputting occurs after the second time and wherein the stored content is output at an accelerated rate until the stored content reaches the live content at a third time, and wherein the one or more sections are omitted when the stored content is output at the accelerated rate; and
  
  outputting the live content at a normal rate after the third time.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, further comprising:
    - selecting one-or-more frames from a video portion of the AV data received after the first time and before the third time based upon one or more of;
      
      relative maxima of motion activity in comparison to a range of antecedent and succedent video frames,motion activity exceeding a first threshold,relative maxima of color histogram changes in comparison to a range of antecedent and succedent video frames,color histogram changes exceeding a second threshold,appearance of a new object, anda failure to detect a previously detected object; and
      
      sequentially outputting each of the one-or-more selected frames as a still image or video clip via the display while the stored content is output at the accelerated rate,wherein the stored content outputted at the accelerated rate comprises one or more of an audio portion of the AV data or text related to the audio portion of the AV data.
  - 3. The method of claim 1, wherein determining that the participant'"'"'s face is oriented in the direction toward the display comprises:
    - determining a position of the display relative to the camera;
      
      identifying features of the participant'"'"'s face including at least one of a participant'"'"'s eye, eyes, nose, mouth, ear, or ears;
      
      determining a position of the identified features relative to the detected face; and
      
      determining an alignment of the identified feature in comparison to the position of the display.

4. A computing system, comprising:
- at least one processor;
  
  a data buffer; and
  
  at least one memory device including instructions operable to be executed by the at least one processor to perform a set of actions, configuring the at least one processor to;
  
  receive audio-visual (AV) data;
  
  output live content from the AV data at a first rate;
  
  detect, at a first time, a negative paying attention signature based upon processing of captured sensor data;
  
  determine, at a second time, a positive paying attention signature based upon processing of subsequent captured sensor data, wherein the second time is after the first time;
  
  store content after the first time;
  
  output the stored content after the second time, the stored content being output at an accelerated rate until the stored content coincides with the live content at a third time, wherein the third time is after the second time;
  
  perform speech processing on an audio portion of the AV data; and
  
  determine an adjusted accelerated rate based on the speech processing to maintain a consistent rate of output of the audio portion in terms of words-per-unit-of-time or phonemes-per-unit-of-time.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 11)
- - 5. The computing system of claim 4, wherein the instructions further configure the at least one processor to:
    - output live content at the first rate after the third time.
  - 6. The computing system of claim 4, wherein the instructions further configure the at least one processor to:
    - detect a head using one or more cameras, the one or more cameras capturing the captured sensor data; and
      
      determine an orientation of the head relative to a display used to output the content at the first rate,wherein a state of the paying attention signature is based upon the direction of eyes on the head relative to the display.
  - 7. The computing system of claim 4, wherein the instructions further configure the at least one processor to:
    - monitor a user interface for input, the user interface capturing the captured sensor data,wherein receiving input from the user interface corresponds to a positive paying attention signature.
  - 8. The computing system of claim 4, wherein the instructions further configure the at least one processor to:
    - pause output of live video content between the first time and the second time; and
      
      output live audio content between the first time and the second time.
  - 9. The computing system of claim 4, wherein the instructions further configure the at least one processor to:
    - select one-or-more video frames received after the first time and before the third time based upon one or more of;
      
      relative maxima of motion activity in comparison to a range of antecedent and succedent video frames,motion activity exceeding a first threshold,relative maxima of color histogram changes in comparison to a range of antecedent and succedent video frames,color histogram changes exceeding a second threshold,appearance of a new object, anda failure to detect a previously detected object; and
      
      sequentially output each of the one-or-more selected frames as a still image or video clip via a display while the stored content is output at the accelerated rate,wherein the stored content outputted at the accelerated rate comprises one or more of an audio portion of the AV data or text related to the audio portion of the AV data.
  - 10. The computing system of claim 4, wherein the stored content that is output at the accelerated rate includes an audio portion of the AV data, the instructions further configuring the at least one processor to:
    - identify superfluous parts of the audio portion of the AV data based on the speech processing, the superfluous parts comprising one or more of;
      
      silences, pauses, spoken filler words, and non-lexical utterances; and
      
      omit the identified superfluous parts of the audio portion of the AV data from the stored content that is output at the accelerated rate.
  - 11. The computing system of claim 4, wherein the instructions further configure the at least one processor to:
    - receive input indicating to change the accelerated rate, including input from a first user interface to increase the accelerated rate and input from a second user interface to decrease the accelerated rate.

12. A non-transitory computer-readable storage medium storing processor-executable instructions for controlling a computing device, comprising program code to configure the computing device to:
- receive audio-visual (AV) data;
  
  output live content from the AV data at a first rate;
  
  detect, at a first time, a negative paying attention signature based upon processing of captured sensor data;
  
  determine, at a second time, a positive paying attention signature based upon processing of subsequent captured sensor data, wherein the second time is after the first time;
  
  store content after the first time;
  
  output the stored content after the second time, the stored content being output at an accelerated rate until the stored content coincides with the live content at a third time, wherein the third time is after the second timeselect one-or-more video frames received after the first time and before the third time based upon one or more of;
  
  relative maxima of motion activity in comparison to a range of antecedent and succedent video frames,motion activity exceeding a first threshold,relative maxima of color histogram changes in comparison to a range of antecedent and succedent video frames,color histogram changes exceeding a second threshold,appearance of a new object, anda failure to detect a previously detected object; and
  
  sequentially output each of the one-or-more selected frames as a still image or video clip via a display while the stored content is output at the accelerated rate,wherein the stored content outputted at the accelerated rate comprises one or more of an audio portion of the AV data or text related to the audio portion of the AV data.
- View Dependent Claims (13, 14, 15, 16, 17, 18)
- - 13. The non-transitory computer-readable storage medium of claim 12, wherein the program code further configures the computing device to:
    - output live content at the first rate after the third time.
  - 14. The non-transitory computer-readable storage medium of claim 12, wherein the program code further configures the computing device to:
    - detect a head using one or more cameras, the one or more cameras capturing the captured sensor data; and
      
      determine an orientation of the head relative to a display used to output the content at the first rate,wherein a state of the paying attention signature is based upon the direction of eyes on of the head relative to the display.
  - 15. The non-transitory computer-readable storage medium of claim 12, wherein the program code further configures the computing device to:
    - monitor a user interfaces for input, the user interface capturing the captured sensor data,wherein receiving input from the user interface corresponds to a positive paying attention signature.
  - 16. The non-transitory computer-readable storage medium of claim 12, wherein the stored content that is output at the accelerated rate includes an audio portion of the AV data, and wherein the program code further configures the computing device to:
    - perform speech processing on the audio portion of the AV data;
      
      identify superfluous parts of the audio portion of the AV data based on the speech processing, the superfluous parts comprising one or more of;
      
      silences, pauses, spoken filler words, and non-lexical utterances; and
      
      omit the identified superfluous parts of the audio portion of the AV data from the stored content that is output at the accelerated rate.
  - 17. The non-transitory computer-readable storage medium of claim 12, wherein the program code further configures the computing device to:
    - perform speech processing on an audio portion of the AV data; and
      
      adaptively adjust the accelerated rate based on the speech processing to maintain a consistent rate of output of the audio portion in terms of words-per-unit-of-time or phonemes-per-unit-of-time.
  - 18. The non-transitory computer-readable storage medium of claim 12, wherein the program code further configures the computing device to:
    - receive input indicating to change the accelerated rate, including input from a first user interface to increase the accelerated rate and input from a second user interface to decrease the accelerated rate.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Agrawal, Amit Kumar, Gray, Timothy Thomas, Tyagi, Ambrish
Primary Examiner(s)
El-Zoobi, Maria

Application Number

US14/230,047
Time in Patent Office

918 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06Q 50/20   Education

H04N 5/783   Adaptations for reproducing...

H04N 7/155   involving storage of or acc...

Catch-up video buffering

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

50 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Catch-up video buffering

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

50 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links