System and method for semantic video segmentation based on joint audiovisual and text analysis

US 20070055695A1
Filed: 08/24/2005
Published: 03/08/2007
Est. Priority Date: 08/24/2005
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for partitioning a video into a series of semantic units wherein each semantic unit relates to a thematic topic, the method comprising:

dividing a video into a plurality of homogeneous segments;

analyzing audio and visual content of the video;

extracting a plurality of keywords from speech content of each of the plurality of homogeneous segments of the video; and

detecting and merging a plurality of groups of semantically related and temporally adjacent homogeneous segments into a series of semantic units in accordance with results of both the audio and visual analysis and the keyword extraction.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

System and method for partitioning a video into a series of semantic units where each semantic unit relates to a generally complete thematic topic. A computer implemented method for partitioning a video into a series of semantic units wherein each semantic unit relates to a theme or a topic, comprises dividing a video into a plurality of homogeneous segments, analyzing audio and visual content of the video, extracting a plurality of keywords from the speech content of each of the plurality of homogeneous segments of the video, and detecting and merging a plurality of groups of semantically related and temporally adjacent homogeneous segments into a series of semantic units in accordance with the results of both the audio and visual analysis and the keyword extraction. The present invention can be applied to generate important table-of-contents as well as index tables for videos to facilitate efficient video topic searching and browsing.

58 Citations

View as Search Results

20 Claims

1. A computer implemented method for partitioning a video into a series of semantic units wherein each semantic unit relates to a thematic topic, the method comprising:
- dividing a video into a plurality of homogeneous segments;
  
  analyzing audio and visual content of the video;
  
  extracting a plurality of keywords from speech content of each of the plurality of homogeneous segments of the video; and
  
  detecting and merging a plurality of groups of semantically related and temporally adjacent homogeneous segments into a series of semantic units in accordance with results of both the audio and visual analysis and the keyword extraction.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The computer-implemented method according to claim 1, wherein dividing a video into a plurality of homogeneous segments comprises:
    - gathering color data from the video;
      
      generating a color histogram for each video frame of the video, and identifying abrupt video content changes based on the detection of abrupt histogram change in neighboring video frames; and
      
      dividing the video into the plurality of homogeneous segments based on the identified abrupt content changes in the video.
  - 3. The computer-implemented method according to claim 1, wherein analyzing audio and visual content of the video comprises:
    - classifying an audio track of the video to provide a sequence of audio segments having distinct semantic sound labels;
      
      classifying the video content of each of the plurality of homogeneous segments to provide the plurality of homogeneous segments having semantic visual labels; and
      
      integrating the sequence of audio segments having distinct semantic sound labels and the plurality of homogeneous segments having semantic visual labels to provide homogeneous segments having semantic audio and visual labels.
  - 4. The computer-implemented method according to claim 1, wherein extracting a plurality of keywords from speech content of each of the plurality of homogeneous segments of the video, comprises:
    - recognizing the speech content of the video;
      
      generating a time-stamped transcript of the recognized speech content; and
      
      extracting the plurality of keywords from the generated time-stamped transcript for each of the plurality of homogeneous segments of the video.
  - 5. The computer-implemented method according to claim 4, wherein:
    - recognizing the speech content of the video comprises recognizing the speech content of the video directly using automatic speech recognition techniques; and
      
      wherein generating a time-stamped transcript of the recognized speech content comprises generating a time-stamped transcript of the directly recognized speech content by attaching time-stamps to words, phrases, or sentences; and
      
      wherein the method further comprises;
      
      determining if the time-stamped transcript generated from the directly recognized speech content is satisfactory; and
      
      if the time-stamped transcript generated from the directly recognized speech content is not satisfactory, indirectly obtaining the speech transcript of the video.
  - 6. The computer-implemented method according to claim 5, wherein indirectly obtaining the speech transcript of the video, comprises:
    - determining whether the video has closed caption;
      
      if the video has closed caption, obtaining speech content using closed caption extraction; and
      
      if the video does not have closed caption, manually converting spoken words of the video to text; and
      
      attaching time-stamps to words, phrases, or sentences to generate the time-stamped transcript.
  - 7. The computer-implemented method according to claim 4, wherein extracting the plurality of keywords from the generated time-stamped transcript, comprises:
    - recognizing content words or phrases in the transcript;
      
      calculating domain specificity and cohesion of the recognized content words or phrases based on their statistical information; and
      
      selecting highly cohesive domain-specific content words as keywords.
  - 8. The computer-implemented method according to claim 1, wherein detecting and merging a plurality of groups of semantically related and temporally adjacent homogeneous segments into a series of semantic units in accordance with results of both the audio and visual analysis and the keyword extraction, comprises:
    - grouping all extracted keywords into a collection of synonym sets;
      
      determining a distribution pattern across all homogeneous segments for each synonym set which contains a sufficient number of words; and
      
      grouping all homogeneous segments which are temporally adjacent and semantically related into a semantic unit based on the determined distribution pattern.
  - 9. The computer-implemented method according to claim 8, wherein each synonym set comprises words of identical or similar meanings, wherein words of identical or similar meanings include abbreviations of words, alternative spellings of words, orthographical variations of words and words which belong to a same semantic category.
  - 10. The computer-implemented method according to claim 8, wherein determining a distribution pattern across all homogeneous segments for each synonym set which contains a sufficient number of words, comprises:
    - building a keyword occurrence histogram for each synonym set.
  - 11. The computer-implemented method according to claim 8, wherein grouping all homogeneous segments which are temporally adjacent and semantically related into a semantic unit based on the determined distribution pattern, comprises:
    - locating boundaries of the semantic units where thematic topics change using the audio and visual analysis results.
  - 12. The computer-implemented method according to claim 11, wherein locating boundaries of the semantic units where thematic topics change using audio and visual analysis results, comprises:
    - finding the homogeneous segments that will be grouped into one semantic unit, and denoting leading and trailing homogeneous segments of the semantic unit; and
      
      locating silence and music homogeneous segments around the leading and trailing homogeneous segments, and designating the silence and music homogeneous segments as the boundaries, or, if silence and music homogeneous segments are not present, locating narrator homogeneous segments and assigning the narrator homogeneous segments as separators of semantic units.
  - 13. The computer-implemented method according to claim 1, and further comprising:
    - deriving a theme or a topic for each of the semantic units.

14. A system for partitioning a video into a series of semantic units wherein each semantic unit relates to a thematic topic, the system comprising:
- a video segmenting unit for dividing a video into a plurality of homogeneous segments;
  
  an audio and visual analyzing unit for analyzing audio and visual content of the video;
  
  a keyword extracting unit for extracting a plurality of keywords from speech content of each of the plurality of homogeneous segments of the video; and
  
  a detecting and merging unit for detecting and merging a plurality of groups of semantically related and temporally adjacent homogeneous segments into a series of semantic units in accordance with results of both the audio and visual analysis and the keyword extraction.
- View Dependent Claims (15, 16, 17)
- - 15. The system according to claim 14, wherein the keyword extracting unit comprises:
    - a speech recognition unit for directly recognizing speech content of the video;
      
      a closed caption extraction unit for indirectly obtaining speech content of the video; and
      
      a manual transcription mode for converting spoken words to text.
  - 16. The system according to claim 14, wherein the detecting and merging unit comprises:
    - a mechanism for grouping all extracted keywords into a collection of synonym sets;
      
      a mechanism for determining a distribution pattern across all homogeneous segments for each synonym set which contains a sufficient number of words; and
      
      a mechanism for grouping all homogeneous segments, which are temporally adjacent and semantically related, into a semantic unit based on the determined distribution pattern.
  - 17. The system according to claim 16, wherein the detecting and merging unit further comprises:
    - a mechanism for locating boundaries of the semantic units.

18. A computer program product for partitioning a video into a series of semantic units wherein each semantic unit relates to a thematic topic, the computer program product comprising:
- a computer usable medium having computer usable program code embodied therein;
  
  computer usable program code configured to divide a video into a plurality of homogeneous segments;
  
  computer usable program code configured to analyze audio and visual content of the video;
  
  computer usable program code configured to extract a plurality of keywords from speech content of each of the plurality of homogeneous segments of the video; and
  
  computer usable program code configured to detect and merge a plurality of groups of semantically related and temporally adjacent homogeneous segments into a series of semantic units in accordance with results of both the audio and visual analysis and the keyword extraction.
- View Dependent Claims (19, 20)
- - 19. The computer program product according to claim 18, wherein the computer usable program code configured to extract a plurality of keywords from speech content of each of the plurality of homogeneous segments of the video, comprises:
    - computer usable program code configured to recognize the speech content of the incoming video sequence;
      
      computer usable program code configured to generate a time-stamped transcript of the recognized speech content; and
      
      computer usable program code configured to extract the list of keywords for each homogeneous segment from the generated transcript.
  - 20. The computer program product according to claim 18, wherein the computer usable program code configured to detect and merge a plurality of groups of semantically related and temporally adjacent homogeneous segments into a series of semantic units in accordance with results of both the audio and visual analysis and the keyword extraction, comprises:
    - computer usable program code configured to group all extracted keywords into a collection of synonym sets;
      
      computer usable program code configured to determine a distribution pattern across all homogeneous segments for each synonym set which contains a sufficient number of words; and
      
      computer usable program code configured to group all homogeneous segments, which are temporally adjacent and semantically related, into a semantic unit based on the determined distribution pattern by locating boundaries of the semantic units where thematic topics change using the audio and visual analysis results.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Dorai, Chitra, Li, Ying, Park, Youngja

Granted Patent

US 7,382,933 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/7834 using audio features

G06F 16/7844 using original textual cont...

System and method for semantic video segmentation based on joint audiovisual and text analysis

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

58 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for semantic video segmentation based on joint audiovisual and text analysis

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

58 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links