Combined-media scene tracking for audio-video summarization
First Claim
1. A method, comprising:
- extracting first text from a first video segment and second text from a second video segment;
identifying topics within the first text and the second text;
comparing the first text with the second text to yield a textual comparison;
generating, via a processor, a contextual similarity measure between the first video segment and the second video segment based on the first video segment, the second video segment, the textual comparison, and the topics;
generating, via the processor, a visual similarity measure based on video similarity of the first video segment and the second video segment;
based on the contextual similarity measure and the visual similarity measure, determining a combined similarity score, represented as a combined distance measure, for the first video segment and the second video segment, wherein the contextual similarity measure is converted to a normalized text distance and the visual similarity measure is converted to a normalized video distance; and
outputting the combined similarity score.
18 Assignments
0 Petitions
Accused Products
Abstract
Techniques are presented for analyzing audio-video segments, usually from multiple sources. A combined similarity measure is determined from text similarities and video similarities. The text and video similarities measure similarity between audio-video scenes for text and video, respectively. The combined similarity measure is then used to determine similar scenes in the audio-video segments. When the audio-video segments are from multiple audio-video sources, the similar scenes are common scenes in the audio-video segments. Similarities may be converted to or measured by distance. Distance matrices may be determined by using the similarity matrices. The text and video distance matrices are normalized before the combined similarity matrix is determined. Clustering is performed using distance values determined from the combined similarity matrix. Resulting clusters are examined and a cluster is considered to represent a common scene between two or more different audio-video segments when scenes in the cluster are similar.
-
Citations
19 Claims
-
1. A method, comprising:
-
extracting first text from a first video segment and second text from a second video segment; identifying topics within the first text and the second text; comparing the first text with the second text to yield a textual comparison; generating, via a processor, a contextual similarity measure between the first video segment and the second video segment based on the first video segment, the second video segment, the textual comparison, and the topics; generating, via the processor, a visual similarity measure based on video similarity of the first video segment and the second video segment; based on the contextual similarity measure and the visual similarity measure, determining a combined similarity score, represented as a combined distance measure, for the first video segment and the second video segment, wherein the contextual similarity measure is converted to a normalized text distance and the visual similarity measure is converted to a normalized video distance; and outputting the combined similarity score. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A system comprising:
-
a processor; and a non-transitory computer-readable storage device storing a computer program which, when executed by the processor, causes the processor to perform operations comprising; extracting first text from a first video segment and second text from a second video segment; identifying topics within the first text and the second text; comparing the first text with the second text to yield a textual comparison; generating a contextual similarity measure between the first video segment and the second video segment based on the first video segment, the second video segment, the textual comparison, and the topics; generating a visual similarity measure for the first video segment and the second video segment based on video similarity; based on the contextual similarity measure and the visual similarity measure, determining a combined similarity score, represented as a combined distance measure, for the first video segment and the second video segment, wherein the contextual similarity measure is converted to a normalized text distance and the visual similarity measure is converted to a normalized video distance; and outputting the combined similarity score.
-
-
19. A non-transitory computer-readable storage device storing a computer program which, when executed by a processor, causes the processor to perform operations comprising:
-
extracting first text from a first video segment and second text from a second video segment; identifying topics within the first text and the second text; comparing the first text with the second text to yield a textual comparison; generating a contextual similarity measure between the first video segment and the second video segment based on the first video segment, the second video segment, the textual comparison, and the topics; generating a visual similarity measure based on video similarity of the first video segment and the second video segment; based on the contextual similarity measure and the visual similarity measure, determining a combined similarity score, represented as a combined distance measure, for the first video segment and the second video segment, wherein the contextual similarity measure is converted to a normalized text distance and the visual similarity measure is converted to a normalized video distance; and outputting the combined similarity score.
-
Specification