System and method for semantic video segmentation based on joint audiovisual and text analysis
First Claim
1. A method for partitioning a video sequence, comprising:
- dividing a video sequence into a plurality of segments;
generating a transcript of speech content of the video sequence, wherein the transcript comprises a plurality of words and identifies temporal locations of the words in the video sequence;
selecting a plurality of keywords from the plurality of words in the transcript;
selecting a set of keywords from the plurality of keywords, wherein the keywords in the set of keywords are related to each other by meanings of the keywords;
determining a distribution of occurrences across the plurality of segments of the keywords in the set of keywords;
selecting a group of segments from the plurality of segments using the distribution, wherein the segments in the group of segments are temporally adjacent and the group of segments corresponds to a peak of the occurrences across the plurality of segments of the keywords in the set of keywords; and
forming a partition of the video sequence from the group of segments;
wherein generating the transcript of speech content of the video sequence comprisesgenerating the transcript from audio content of the video sequence using automatic speech recognition,determining whether the transcript generated from the audio content is satisfactory,responsive to a determination that the transcript generated from the audio content is not satisfactory, determining whether the video sequence has closed caption, andresponsive to a determination that the video sequence has closed caption, generating the transcript from the closed caption.
0 Assignments
0 Petitions
Accused Products
Abstract
System and method for partitioning a video into a series of semantic units where each semantic unit relates to a generally complete thematic topic. A computer implemented method for partitioning a video into a series of semantic units wherein each semantic unit relates to a theme or a topic, comprises dividing a video into a plurality of homogeneous segments, analyzing audio and visual content of the video, extracting a plurality of keywords from the speech content of each of the plurality of homogeneous segments of the video, and detecting and merging a plurality of groups of semantically related and temporally adjacent homogeneous segments into a series of semantic units in accordance with the results of both the audio and visual analysis and the keyword extraction. The present invention can be applied to generate important table-of-contents as well as index tables for videos to facilitate efficient video topic searching and browsing.
-
Citations
16 Claims
-
1. A method for partitioning a video sequence, comprising:
-
dividing a video sequence into a plurality of segments; generating a transcript of speech content of the video sequence, wherein the transcript comprises a plurality of words and identifies temporal locations of the words in the video sequence; selecting a plurality of keywords from the plurality of words in the transcript; selecting a set of keywords from the plurality of keywords, wherein the keywords in the set of keywords are related to each other by meanings of the keywords; determining a distribution of occurrences across the plurality of segments of the keywords in the set of keywords; selecting a group of segments from the plurality of segments using the distribution, wherein the segments in the group of segments are temporally adjacent and the group of segments corresponds to a peak of the occurrences across the plurality of segments of the keywords in the set of keywords; and forming a partition of the video sequence from the group of segments;
wherein generating the transcript of speech content of the video sequence comprisesgenerating the transcript from audio content of the video sequence using automatic speech recognition, determining whether the transcript generated from the audio content is satisfactory, responsive to a determination that the transcript generated from the audio content is not satisfactory, determining whether the video sequence has closed caption, and responsive to a determination that the video sequence has closed caption, generating the transcript from the closed caption. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for partitioning a video sequence, comprising:
-
dividing a video sequence into a plurality of segments; selecting a group of segments from the plurality of segments, wherein the segments in the group of segments are temporally adjacent; forming a partition of the video sequence from the group of segments; denoting an end segment, wherein the end segment is a one of the plurality of segments in the group of segments that is located at an end of the group of segments; determining whether an audio content of any of the plurality of segments around the end segment includes only music or only silence; responsive to a determination that the audio content of the any of the plurality of segments around the end segment includes only music or only silence, selecting a one of the any of the plurality of segments around the end segment that includes only music or only silence as a boundary for the partition; and responsive to a determination that the audio content of the any of the plurality of segments around the end segment does not include only music or only silence, locating a one of the plurality of segments around the end segment having visual content including a narrator shot and selecting the one of the plurality of segments having visual content including a narrator shot as the boundary for the partition.
-
-
7. An apparatus, comprising:
a processing unit, wherein the processing unit is configured to; divide a video sequence into a plurality of segments; generate a transcript of speech content of the video sequence, wherein the transcript comprises a plurality of words and identifies temporal locations of the words in the video sequence; select a plurality of keywords from the plurality of words in the transcript; select a set of keywords from the plurality of keywords, wherein the keywords in the set of keywords are related to each other by meanings of the keywords; determine a distribution of occurrences across the plurality of segments of the keywords in the set of keywords; select a group of segments from the plurality of segments using the distribution, wherein the segments in the group of segments are temporally adjacent and the group of segments corresponds to a peak of the occurrences across the plurality of segments of the keywords in the set of keywords; form a partition of the video sequence from the group of segments; generate the transcript of speech content of the video sequence from audio content of the video sequence using automatic speech recognition; determine whether the transcript generated from the audio content is satisfactory; determine whether the video sequence has closed caption, responsive to a determination that the transcript generated from the audio content is not satisfactory; and generate the transcript from the closed caption, responsive to a determination that the video sequence has closed caption. - View Dependent Claims (8, 9, 10)
-
11. An apparatus, comprising:
-
a processing unit, wherein the processing unit is configured to; divide a video sequence into a plurality of segments; select a group of segments from the plurality of segments, wherein the segments in the group of segments are temporally adjacent; form a partition of the video sequence from the group of segments; denote an end segment, wherein the end segment is a one of the plurality of segments in the group of segments that is located at an end of the group of segments; determine whether an audio content of any of the plurality of segments around the end segment includes only music or only silence; select a one of the any of the plurality of segments around the end segment that includes only music or only silence as a boundary for the partition, responsive to a determination that the audio content of the any of the plurality of segments around the end segment includes only music or only silence; locate a one of the plurality of segments around the end segment having visual content including a narrator shot; and select the one of the plurality of segments having visual content including a narrator shot as the boundary for the partition, responsive to a determination that the audio content of the any of the plurality of segments around the end segment does not include only music or only silence.
-
-
12. A computer program product for partitioning a video sequence, comprising:
-
a non-transitory computer readable storage medium; first program instructions to divide a video sequence into a plurality of segments; second program instructions to generate a transcript of speech content of the video sequence, wherein the transcript comprises a plurality of words and identifies temporal locations of the words in the video sequence; third program instructions to select a plurality of keywords from the plurality of words in the transcript; fourth program instructions to select a set of keywords from the plurality of keywords, wherein the keywords in the set of keywords are related to each other by meanings of the keywords; fifth program instructions to determine a distribution of occurrences across the plurality of segments of the keywords in the set of keywords; sixth program instructions to select a group of segments from the plurality of segments using the distribution, wherein the segments in the group of segments are temporally adjacent and the group of segments corresponds to a peak of the occurrences across the plurality of segments of the keywords in the set of keywords; seventh program instructions to form a partition of the video sequence from the group of segments; and wherein the first, second, third, fourth, fifth, sixth, and seventh program instructions are stored on the non-transitory computer readable storage medium, wherein the second program instructions comprise program instructions to; generate the transcript of speech content of the video sequence from audio content of the video sequence using automatic speech recognition; determine whether the transcript generated from the audio content is satisfactory; determine whether the video sequence has closed caption, responsive to a determination that the transcript generated from the audio content is not satisfactory; and generate the transcript from the closed caption, responsive to a determination that the video sequence has closed caption. - View Dependent Claims (13, 14, 15)
-
-
16. A computer program product for partitioning a video sequence, comprising:
-
a non-transitory computer readable storage medium; first program instructions to divide a video sequence into a plurality of segments; second program instructions to select a group of segments from the plurality of segments, wherein the segments in the group of segments are temporally adjacent; third program instructions to form a partition of the video sequence from the group of segments; fourth program instructions to denote an end segment, wherein the end segment is a one of the plurality of segments in the group of segments that is located at an end of the group of segments; fifth program instructions to determine whether an audio content of any of the plurality of segments around the end segment includes only music or only silence; sixth program instructions to select a one of the any of the plurality of segments around the end segment that includes only music or only silence as a boundary for the partition, responsive to a determination that the audio content of the any of the plurality of segments around the end segment includes only music or only silence; seventh program instructions to locate a one of the plurality of segments around the end segment having visual content including a narrator shot; eighth program instructions to select the one of the plurality of segments having visual content including a narrator shot as the boundary for the partition, responsive to a determination that the audio content of the any of the plurality of segments around the end segment does not include only music or only silence; and wherein the first, second, third, fourth, fifth, sixth, seventh, and eighth program instructions are stored on the non-transitory computer readable storage medium.
-
Specification