×

System and method for semantic video segmentation based on joint audiovisual and text analysis

  • US 7,382,933 B2
  • Filed: 08/24/2005
  • Issued: 06/03/2008
  • Est. Priority Date: 08/24/2005
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer implemented method for partitioning a video into a series of semantic units wherein each semantic unit relates to a thematic topic, the method comprising:

  • dividing the video into a plurality of homogeneous segments, wherein dividing the video into the plurality of homogeneous segments further comprises;

    generating a color histogram for each video frame of the video;

    identifying abrupt video content changes based on a detection of abrupt histogram change in neighboring video frames; and

    dividing the video into the plurality of homogeneous segments based on the abrupt video content changes identified in the video;

    analyzing audio and visual content of the video, wherein analyzing the audio and the visual content of the video further comprises;

    classifying an audio track of the video to provide a sequence of audio segments having distinct semantic sound labels;

    classifying the visual content of each of the plurality of homogeneous segments to provide a plurality of homogeneous segments having semantic visual labels; and

    integrating the sequence of audio segments having distinct semantic sound labels and the plurality of homogeneous segments having semantic visual labels to provide homogeneous segments having semantic audio and visual labels;

    extracting a plurality of keywords from speech content of each of the plurality of homogeneous segments of the video, wherein extracting the plurality of keywords from the speech content further comprises;

    recognizing the speech content of the video directly using automatic speech recognition techniques;

    generating a time-stamped transcript of the speech content to form directly recognized speech content by attaching time-stamps to at least one of words, phrases, or sentences; and

    responsive to a determination that the directly recognized speech content is unsatisfactory, indirectly obtaining a speech transcript of the video, wherein indirectly obtaining the speech transcript of the video further comprises;

    extracting the plurality of keywords from the time-stamped transcript generated for each of the plurality of homogeneous segments of the video, wherein extracting the plurality of keywords from the time-stamped transcript further comprises;

    recognizing content words or phrases in the time-stamped transcript;

    calculating domain specificity and cohesion of the content words or phrases recognized based on their statistical information; and

    selecting highly cohesive domain-specific content words as the plurality of keywords;

    detecting and merging a plurality of groups of semantically related and temporally adjacent homogeneous segments into a series of semantic units in accordance with results of both the audio and the visual analysis and the plurality of keywords from the speech content extraction, wherein detecting and merging a plurality of groups of semantically related and temporally adjacent homogeneous segments into the series of semantic units further comprises;

    grouping the plurality of keywords extracted into a collection of synonym sets;

    determining a distribution pattern across the plurality of homogeneous segments for each of the synonym sets by building a keyword occurrence histogram for each of the synonym sets, wherein the synonym sets comprise words of identical or similar meanings; and

    grouping the plurality of homogeneous segments which are temporally adjacent and semantically related into a semantic unit of the series of semantic units based on the distribution pattern by locating boundaries of the semantic unit where thematic topics change using results from the audio and the visual content analysis, wherein locating the boundaries of the semantic unit where the thematic topics change further comprises;

    finding the plurality of homogeneous segments that will be grouped into the semantic unit;

    denoting leading and trailing homogeneous segments of the semantic unit;

    locating silence and music homogeneous segments around the leading and trailing homogeneous segments of the semantic unit;

    locating narrator homogeneous segments; and

    designating the silence and music homogeneous segments as the boundaries, or, if the silence and music homogeneous segments are not present, designating the narrator homogeneous segments as separators of the semantic unit; and

    deriving a theme or a topic for each of the series of semantic units.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×