Method and apparatus for generating a condensed version of a video sequence including desired affordances

US 6,560,281 B1
Filed: 02/24/1998
Issued: 05/06/2003
Est. Priority Date: 02/24/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A method for generating a condensed version of a video sequence suitable for publication as an annotated video comprising steps of:

storing the video sequence as a set of image frames;

stabilizing the image frames into a warped sequence of distinct and stationary scene changes wherein each scene change is comprised of an associated subset of the image frames;

generating a key frame for each scene change representative of the associated subset including generating a template image frame from the associated subset by median filtering of the associated subset and matching the template image to a closest one of the associated subset, wherein the closest one comprises the key frame;

comparing the key frame with the associated subset for identifying image frames including desired affordances; and

compiling the condensed version to comprise a set of key frames and desired affordance images.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus analyzes and annotates a technical talk typically illustrated with overhead slides, wherein the slides are recorded in a video sequence. The video sequence is condensed and digested into key video frames adaptable for annotation to time and audio sequence. The system comprises a recorder for recording a technical talk as a sequential set of video image frames. A stabilizing processor segregates the video image frames into a plurality of associated subsets each corresponding to a distinct slide displayed at the talk and for median filtering of the subsets for generating a key frame representative of each of the subsets. A comparator compares the key frame with the associated subsets to identify differences between the key frame and the associates subset which comprise nuisances and affordances. A gesture recognizer locates, tracks and recognizes gestures occurring in the subset as gesture affordances and identifies a gesture video frame representative of the gesture affordance. An integrator compiles the key frames and gesture video frames as a digest of the video image frames which can also be annotated with the time and audio sequence.

Citations

17 Claims

1. A method for generating a condensed version of a video sequence suitable for publication as an annotated video comprising steps of:
- storing the video sequence as a set of image frames;
  
  stabilizing the image frames into a warped sequence of distinct and stationary scene changes wherein each scene change is comprised of an associated subset of the image frames;
  
  generating a key frame for each scene change representative of the associated subset including generating a template image frame from the associated subset by median filtering of the associated subset and matching the template image to a closest one of the associated subset, wherein the closest one comprises the key frame;
  
  comparing the key frame with the associated subset for identifying image frames including desired affordances; and
  
  compiling the condensed version to comprise a set of key frames and desired affordance images.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method as defined in claim 1 wherein the stabilizing comprises analyzing every two consecutive frames in the video sequence for estimating a global image motion between said consecutive frames.
  - 3. The method as defined in claim 2 wherein the estimating comprises a robust regression method.
  - 4. The method as defined in claim 3 wherein the estimating includes generating an error computation in excess of a predetermined limit and identifying the excessive error computation as one of the scene changes.
  - 5. The method as defined in claim 1 wherein said comparing comprises computing pixel-wise differences between the key frame and the associated subset.
  - 6. The method as defined in claim 5 wherein a one of the desired affordances comprises a pointing gesture and said computing the pixel-wise differences includes detecting an active boundary contour in the associated subset representative of the pointing gesture.
  - 7. The method as defined in claim 5 wherein said computing the pixel-wise differences comprises defining a predetermined vocabulary of the desired affordances.
  - 8. The method as defined in claim 1 wherein the desired affordances include a pointing gesture and the identifying comprises identifying the pointing gesture by forming a boundary edge for an image and detecting a contour in the boundary edge representative of the pointing gesture.
  - 9. The method as defined in claim 8 wherein detecting the contour includes computing a distance from an associated boundary edge position at the contour and an innermost position of the contour, a line of the distance being detected as a pointing direction of the pointing gesture.
  - 10. The method as defined in claim 1 further including deleting redundant image frames and nuisance variations from the video sequence.

11. A method for forming a digested compilation of a sequential set of data frames including deleting redundant adjacent frames and nuisance variations therein, comprising:
- generating a warped sequence of the sequential set wherein significant changes in the sequential set are detected for segregating the sequential set into associated subsets;
  
  detecting a key frame representative of each the associated subsets;
  
  comparing each key frame with each associated subset for detecting the nuisance variations and data frames having desired affordances, the detecting the nuisance variations including a word-wise comparison between the key frame and the associated subset; and
  
  , integrating the key frames with the data frames having the desired affordances and thereby deleting the redundant data frames and the data frames having the nuisance variations, to form the digested compilation.
- View Dependent Claims (12)
- - 12. The method as defined in claim 11 wherein the generating the warped sequence comprises detecting the significant changes by an error estimation calculation wherein the significant changes correspond to an error calculation greater than a predetermined limit.

13. A system for analyzing and annotating a technical talk recorded as a video and audio sequence for condensing the video sequence into a digest of key video frames annotated to time and the audio sequence, comprising:
- a recorder for recording the technical talk as a sequential set of video image frames;
  
  a stabilizing processor for segregating the video image frames into a plurality of associated subsets each corresponding to a distinct slide displayed at the talk and for median filtering of the subsets for generating a key frame representative of the subsets;
  
  a comparator for comparing the key frame with the associated subset to identify differences between the key frame and associated subset comprising nuisances and affordances;
  
  a gesture recognizer for locating, tracking and recognizing gestures occurring in the subset as a gesture affordance an for identifying a gesture video frame representative of the gesture affordance, wherein the gesture recognizer includes a vocabulary of desired gesture affordances and a comparator for comparing the subset with the key frame and matching the affordances located thereby with the vocabulary; and
  
  , an integrator for compiling the key frames and the gesture video frames as a digest of the video image frames and for annotating the digest with the time and the audio sequence.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The system as defined in claim 13 wherein the stabilizing processor comprises means for analyzing every two consecutive frames in the video sequence for estimating a global image motion between said consecutive frames and for identifying the distinct slide by an estimation error in excess of a predetermined limit.
  - 15. The system as defined in claim 13 wherein desired gesture affordances comprise a pointing gesture.
  - 16. The system as defined in claim 13 wherein desired gesture affordances comprise writing or revealing gestures.
  - 17. The system as defined in claim 13 further including means for accessing the technical talk from a web page.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Kimber, Donald G., Black, Michael J., Ju, Xuan, Minneman, Scott
Primary Examiner(s)
Kelley, Chris
Assistant Examiner(s)
An, Shawn S.

Application Number

US09/028,548
Time in Patent Office

1,897 Days
Field of Search

375/240, 386/96, 386/97, 386/106, 348/552, 348/169-172, 348/700, 348/699, 395/333, 395/334, 345/158, 345/156, 345/775, 382/103
US Class Current

375/240
CPC Class Codes

G06F 16/739   in form of a video summary,...

G06F 16/784   the detected or recognised ...

G06F 16/786   using motion, e.g. object m...

G06F 18/28   Determining representative ...

Method and apparatus for generating a condensed version of a video sequence including desired affordances

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for generating a condensed version of a video sequence including desired affordances

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links