Combined-media scene tracking for audio-video summarization

US 8,872,979 B2
Filed: 05/21/2002
Issued: 10/28/2014
Est. Priority Date: 05/21/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A method, comprising:

extracting first text from a first video segment and second text from a second video segment;

identifying topics within the first text and the second text;

comparing the first text with the second text to yield a textual comparison;

generating, via a processor, a contextual similarity measure between the first video segment and the second video segment based on the first video segment, the second video segment, the textual comparison, and the topics;

generating, via the processor, a visual similarity measure based on video similarity of the first video segment and the second video segment;

based on the contextual similarity measure and the visual similarity measure, determining a combined similarity score, represented as a combined distance measure, for the first video segment and the second video segment, wherein the contextual similarity measure is converted to a normalized text distance and the visual similarity measure is converted to a normalized video distance; and

outputting the combined similarity score.

View all claims

18 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are presented for analyzing audio-video segments, usually from multiple sources. A combined similarity measure is determined from text similarities and video similarities. The text and video similarities measure similarity between audio-video scenes for text and video, respectively. The combined similarity measure is then used to determine similar scenes in the audio-video segments. When the audio-video segments are from multiple audio-video sources, the similar scenes are common scenes in the audio-video segments. Similarities may be converted to or measured by distance. Distance matrices may be determined by using the similarity matrices. The text and video distance matrices are normalized before the combined similarity matrix is determined. Clustering is performed using distance values determined from the combined similarity matrix. Resulting clusters are examined and a cluster is considered to represent a common scene between two or more different audio-video segments when scenes in the cluster are similar.

Citations

19 Claims

1. A method, comprising:
- extracting first text from a first video segment and second text from a second video segment;
  
  identifying topics within the first text and the second text;
  
  comparing the first text with the second text to yield a textual comparison;
  
  generating, via a processor, a contextual similarity measure between the first video segment and the second video segment based on the first video segment, the second video segment, the textual comparison, and the topics;
  
  generating, via the processor, a visual similarity measure based on video similarity of the first video segment and the second video segment;
  
  based on the contextual similarity measure and the visual similarity measure, determining a combined similarity score, represented as a combined distance measure, for the first video segment and the second video segment, wherein the contextual similarity measure is converted to a normalized text distance and the visual similarity measure is converted to a normalized video distance; and
  
  outputting the combined similarity score.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, wherein the first video segment and the second video segment originate from a single audio-video source.
  - 3. The method of claim 1, wherein the first video segment and the second video segment originate from a plurality of separate audio-video sources.
  - 4. The method of claim 3, wherein identifying the topics comprises:
    - (i) comparing the first text with the second text to produce a text similarity value; and
      
      (ii) performing step (i) for each video segment of the plurality of separate audio-video sources with every other video segment of the plurality of separate audio-video sources.
  - 5. The method of claim 4, wherein identifying the topics further comprises:
    - (iii) setting the text similarity value corresponding to a video segment to a predetermined value;
      
      (iv) performing step (iii) until each text similarity value for every video segment has been set to the predetermined value.
  - 6. The method of claim 1, wherein extracting the first text and the second text comprises using text corresponding to the first video segment and the second video segment.
  - 7. The method of claim 1, wherein extracting the first text and the second text comprises obtaining the first text and the second text through speech-to-text conversion.
  - 8. The method of claim 1, wherein extracting the first text and the second text comprises obtaining the first text and the second text by accessing closed captioning data.
  - 9. The method of claim 1, wherein generating the visual similarity measure comprises using video corresponding to the first video segment and the second video segment.
  - 10. The method of claim 1, further comprising:
    - detecting scene changes in the first video segment and the second video segment; and
      
      marking each scene change detected in the first video segment and the second video segment.
  - 11. The method of claim 10, further comprising determining text corresponding to the each scene change.
  - 12. The method of claim 1, further comprising determining, for each of the first video segment and the second video segment an image as a key frame.
  - 13. The method of claim 1, wherein the normalized text distance is stored in a first matrix, the normalized video distance is stored in a second matrix, and the combined distance measure is stored in a third matrix.
  - 14. The method of claim 1, wherein determining the combined similarity score comprises linearly combining the normalized text distance and the normalized video distance to create the combined distance measure.
  - 15. The method of claim 1, wherein determining the combined similarity score comprises combining the normalized text distance and the normalized video distance through a non-linear equation to create the combined distance measure.
  - 16. The method of claim 15, further comprising:
    - clustering video segments based on the visual similarity measure and the contextual similarity measure; and
      
      determining clusters having audio-video scenes corresponding to a plurality of audio-video segments, whereby the audio-video scenes corresponding to a single cluster are considered similar portions of the plurality of audio-video segments.
  - 17. The method of claim 16, wherein clustering the video segments further comprises:
    - assigning each of the audio-video scenes to a unique cluster;
      
      determining, based on the combined distance measure, a minimum inter-cluster distance between two clusters of the clusters;
      
      when the minimum inter-cluster distance is not greater than a predefined distance, merging the two clusters and determining a new minimum inter-cluster distance; and
      
      when the minimum inter-cluster distance is greater than the predefined distance, stopping clustering.

18. A system comprising:
- a processor; and
  
  a non-transitory computer-readable storage device storing a computer program which, when executed by the processor, causes the processor to perform operations comprising;
  
  extracting first text from a first video segment and second text from a second video segment;
  
  identifying topics within the first text and the second text;
  
  comparing the first text with the second text to yield a textual comparison;
  
  generating a contextual similarity measure between the first video segment and the second video segment based on the first video segment, the second video segment, the textual comparison, and the topics;
  
  generating a visual similarity measure for the first video segment and the second video segment based on video similarity;
  
  based on the contextual similarity measure and the visual similarity measure, determining a combined similarity score, represented as a combined distance measure, for the first video segment and the second video segment, wherein the contextual similarity measure is converted to a normalized text distance and the visual similarity measure is converted to a normalized video distance; and
  
  outputting the combined similarity score.

19. A non-transitory computer-readable storage device storing a computer program which, when executed by a processor, causes the processor to perform operations comprising:
- extracting first text from a first video segment and second text from a second video segment;
  
  identifying topics within the first text and the second text;
  
  comparing the first text with the second text to yield a textual comparison;
  
  generating a contextual similarity measure between the first video segment and the second video segment based on the first video segment, the second video segment, the textual comparison, and the topics;
  
  generating a visual similarity measure based on video similarity of the first video segment and the second video segment;
  
  based on the contextual similarity measure and the visual similarity measure, determining a combined similarity score, represented as a combined distance measure, for the first video segment and the second video segment, wherein the contextual similarity measure is converted to a normalized text distance and the visual similarity measure is converted to a normalized video distance; and
  
  outputting the combined similarity score.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Avaya Incorporated
Original Assignee
Avaya Incorporated
Inventors
Bagga, Amit, Hu, Jianying, Zhong, Jialin
Primary Examiner(s)
AN, SHAWN S

Application Number

US10/153,550
Publication Number

US 20030218696A1
Time in Patent Office

4,543 Days
Field of Search

348/700, 348/699, 348/907, 348/701, 382/225, 382/226, 382/229, 382/182, 382/221, 382/224, 382/209, 382/183, 382/184, 382/185, 382/176
US Class Current

348/700
CPC Class Codes

G06F 16/7844   using original textual cont...

G06F 16/785   using colour or luminescence

G06F 18/23   Clustering techniques

G06F 3/018   Input/output arrangements f...

G06V 20/10   Terrestrial scenes scenes u...

G06V 20/41   Higher-level, semantic clus...

G06V 20/49   Segmenting video sequences,...

G06V 30/224   of printed characters havin...

G06V 30/413   Classification of content, ...

G11B 27/28   by using information signal...

H04N 21/4143   embedded in a Personal Comp...

H04N 21/435   Processing of additional da...

H04N 21/4394   involving operations for an...

H04N 21/44008   involving operations for an...

H04N 21/440236   by media transcoding, e.g. ...

H04N 21/4622   Retrieving content or addit...

H04N 21/84   Generation or processing of...

H04N 21/8456   by decomposing the content ...

H04N 21/8549   Creating video summaries, e...

H04N 5/073   for mutually locking plural...

H04N 5/144 : Movement detection for vide...

H04N 5/145 : Movement estimation for vid...

H04N 5/147 : Scene change detection

H04N 5/44504 : Circuit details of the addi...

View All

Combined-media scene tracking for audio-video summarization

First Claim

18 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Combined-media scene tracking for audio-video summarization

First Claim

18 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links