Creating audio-centric, imagecentric, and integrated audio visual summaries
First Claim
1. A method of creating an audio-centric audio-visual summary of a video program, said video program having an audio track and an image track, said method comprising:
- selecting a length of time Lsum of said audio-visual summary;
examining said audio track and image track;
identifying one or more audio segments from said audio track based on one or more predetermined audio, image, speech, and text characteristics which relate to desired content of said audio-visual summary, wherein said identifying is performed in accordance with a machine learning method which relies on previously-generated experience-based learning data to provide, for each of said audio segments in said video program, a probability that a given audio segment is suitable for inclusion in said audio-visual summary;
adding said audio segments to said audio-visual summary;
performing said identifying and adding in descending order of said probability until the length of time Lsum is reached; and
selecting only one or more image segments corresponding to the one or more identified audio segments, so as to yield a high degree of synchronization between said one or more audio segments and said one or more image segments.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods create high quality audio-centric, image-centric, and integrated audio-visual summaries by seamlessly integrating image, audio, and text features extracted from input video. Integrated summarization may be employed when strict synchronization of audio and image content is not required. Video programming which requires synchronization of the audio content and the image content may be summarized using either an audio-centric or an image-centric approach. Both a machine learning-based approach and an alternative, heuristics-based approach are disclosed. Numerous probabilistic methods may be employed with the machine learning-based learning approach, such as naïve Bayes, decision tree, neural networks, and maximum entropy. To create an integrated audio-visual summary using the alternative, heuristics-based approach, a maximum-bipartite-matching approach is disclosed by way of example.
184 Citations
78 Claims
-
1. A method of creating an audio-centric audio-visual summary of a video program, said video program having an audio track and an image track, said method comprising:
-
selecting a length of time Lsum of said audio-visual summary;
examining said audio track and image track;
identifying one or more audio segments from said audio track based on one or more predetermined audio, image, speech, and text characteristics which relate to desired content of said audio-visual summary, wherein said identifying is performed in accordance with a machine learning method which relies on previously-generated experience-based learning data to provide, for each of said audio segments in said video program, a probability that a given audio segment is suitable for inclusion in said audio-visual summary;
adding said audio segments to said audio-visual summary;
performing said identifying and adding in descending order of said probability until the length of time Lsum is reached; and
selecting only one or more image segments corresponding to the one or more identified audio segments, so as to yield a high degree of synchronization between said one or more audio segments and said one or more image segments. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39)
-
-
12. A method of creating an image-centric audio-visual summary of a video program, said video program having an audio track and an image track, said method comprising:
-
selecting a length of time Lsum of said audio-visual summary;
examining said image track and audio track of said video program;
identifying one or more image segments from said image track based on one or more predetermined image, audio, speech, and text characteristics which relate to desired content of said audio-visual summary, wherein said identifying is performed in accordance with a machine learning method which relies on previously-generated experience-based learning data to provide, for each of said image segments in said video program, a probability that a given image segment is suitable for inclusion in said audio-visual summary;
adding said one or more image segments to said audio-visual summary;
performing said identifying and adding in descending order of said probability until the length of time Lsum is reached; and
selecting only one or more audio segments corresponding to the one or more identified image segments, so as to yield a high degree of synchronization between said one or more image segments and said one or more audio segments.
-
-
26. A method of creating an integrated audio-visual summary of a video program, said video program having an audio track and a video track, said method comprising:
-
selecting a length of time Lsum of said audio-visual summary;
selecting a minimum playback time Lmin for each of said image segments to be included in the audio-visual summary;
creating an audio summary by selecting one or more desired audio segments until the audio-visual summary length Lsum is reached, said selecting being determined in accordance with a machine learning method which relies on previously-generated experience-based learning data to provide, for each of said audio segments in said video program, a probability that a given audio segment is suitable for inclusion in said audio-visual summary;
computing, for each of said image segments, a probability that a given image segment is suitable for inclusion in said audio-visual summary in accordance with said machine learning method;
for each of said audio segments that are selected, examining a corresponding image segment to see whether a resulting audio segment/image segment pair meets a predefined alignment requirement;
if the resulting audio segment/image segment pair meets the predefined alignment requirement, aligning the audio segment and the image segment in the pair from their respective beginnings for said minimum playback time Lmin to define a first alignment point;
repeating said examining and aligning to identify all of said alignment points;
dividing said length of said audio-visual summary into a plurality of partitions, each of said partitions having a time period either starting from a beginning of said audio-visual summary and ending at the first alignment point;
orstarting from an end of the image segment at one alignment point, and ending at a next alignment point;
orstarting from an end of the image segment at a last alignment point and ending at the end of said audio-visual summary; and
for each of said partitions, adding further image segments in accordance with the following;
identifying a set of image segments that fall into the time period of that partition;
determining a number of image segments that can be inserted into said partition;
determining a length of the identified image segments to be inserted;
selecting said number of the identified image segments in descending order of said probability that a given image segment is suitable for insertion in said audio-visual summary; and
from each of the selected image segments, collecting a section from its respective beginning for said time length and adding all the collected sections in ascending time order into said partition.
-
-
40. A method of creating an audio-centric audio-visual summary of a video program, said video program having an audio track and an image track, said method comprising:
-
selecting a length of time Lsum of said audio-visual summary;
examining said audio track and image track;
identifying one or more audio segments from said audio track based on one or more predetermined audio, image, speech, and text characteristics which relate to desired content of said audio-visual summary, wherein said identifying is performed in accordance with a predetermined set of heuristic rules to provide, for each of said audio segments in said video program, a ranking so as to determine whether a given audio segment is suitable for inclusion in said audio-visual summary;
adding said audio segments to said audio-visual summary;
performing said identifying and adding in descending order of said ranking of audio segments until the length of time Lsum is reached; and
selecting only one or more image segments corresponding to the one or more identified audio segments, so as to yield a high degree of synchronization between said one or more audio segments and said one or more image segments. - View Dependent Claims (41, 42, 43, 44, 45, 46, 47, 48, 49)
-
-
50. A method of creating an image-centric audio-visual summary of a video program, said video program having an audio track and an image track, said method comprising:
-
selecting a length of time Lsum of said summary;
examining said image track and audio track;
identifying one or more image segments from said image track based on one or more predetermined image, audio, speech, and text characteristics which relate to desired content of said audio-visual summary, wherein said identifying is performed in accordance with a predetermined set of heuristic rules to provide, for each of said image segments in said video program, a ranking so as to determine whether a given image segment is suitable for inclusion in said audio-visual summary;
adding said one or more image segments to said audio-visual summary;
performing said identifying and adding in descending order of said ranking until the length of time Lsum is reached; and
selecting only one or more audio segments corresponding to the one or more identified image segments, so as to yield a high degree of synchronization between said one or more image segments and said one or more audio segments. - View Dependent Claims (51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63)
-
-
64. A method of creating an integrated audio-visual summary of a video program, said video program having an audio track and a video track, said method comprising:
-
selecting a length Lsum of said audio-visual summary;
selecting a minimum playback time Lmin for each of a plurality of image segments to be included in the audio-visual summary;
creating an audio summary by selecting one or more desired audio segments, said selecting being determined in accordance with a predetermined set of heuristic rules to provide, for each of said audio segments in said video program, a ranking to determine whether a given audio segment is suitable for inclusion in said video summary;
performing said selecting in descending order of said ranking of audio segments until said audio-visual summary length is reached;
grouping said image segments of said video program into a plurality of frame clusters based on a visual similarity and a dynamic level of said image segments, wherein each frame cluster comprises at least one of said image segments, with all the image segments within a given frame cluster being visually similar to one another;
for each of said audio segments that are selected, examining a corresponding image segment to see whether a resulting audio segment/image segment pair meets a predefined alignment requirement;
if the resulting audio segment/image segment pair meets the predefined alignment requirement, aligning the audio segment and the image segment in the pair from their respective beginnings for said minimum playback time Lmin to define a first alignment point;
repeating said examining and aligning to identify all of said alignment points;
dividing said length of said audio-visual summary into a plurality of partitions, each of said partitions having a time period either starting from a beginning of said audio-visual summary and ending at the first alignment point;
orstarting from an end of the image segment at one alignment point, and ending at a next alignment point;
orstarting from an end of the image segment at a last alignment point and ending at the end of said audio-visual summary; and
dividing each of said partitions into a plurality of time slots, each of said time slots having a length equal to said minimum playback time Lmin;
assigning said frame clusters to fill said time slots of each of said partitions based on the following;
assigning each frame cluster to only one time slot; and
maintaining a time order of all image segments in the audio-visual summary;
wherein said assigning said frame clusters to fill said time slots is performed in accordance with a best matching between said frame clusters and said time slots. - View Dependent Claims (65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78)
-
Specification