System and method for semantic video segmentation based on joint audiovisual and text analysis

US 8,121,432 B2
Filed: 03/25/2008
Issued: 02/21/2012
Est. Priority Date: 08/24/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method for partitioning a video sequence, comprising:

dividing a video sequence into a plurality of segments;

generating a transcript of speech content of the video sequence, wherein the transcript comprises a plurality of words and identifies temporal locations of the words in the video sequence;

selecting a plurality of keywords from the plurality of words in the transcript;

selecting a set of keywords from the plurality of keywords, wherein the keywords in the set of keywords are related to each other by meanings of the keywords;

determining a distribution of occurrences across the plurality of segments of the keywords in the set of keywords;

selecting a group of segments from the plurality of segments using the distribution, wherein the segments in the group of segments are temporally adjacent and the group of segments corresponds to a peak of the occurrences across the plurality of segments of the keywords in the set of keywords; and

forming a partition of the video sequence from the group of segments;

wherein generating the transcript of speech content of the video sequence comprisesgenerating the transcript from audio content of the video sequence using automatic speech recognition,determining whether the transcript generated from the audio content is satisfactory,responsive to a determination that the transcript generated from the audio content is not satisfactory, determining whether the video sequence has closed caption, andresponsive to a determination that the video sequence has closed caption, generating the transcript from the closed caption.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

System and method for partitioning a video into a series of semantic units where each semantic unit relates to a generally complete thematic topic. A computer implemented method for partitioning a video into a series of semantic units wherein each semantic unit relates to a theme or a topic, comprises dividing a video into a plurality of homogeneous segments, analyzing audio and visual content of the video, extracting a plurality of keywords from the speech content of each of the plurality of homogeneous segments of the video, and detecting and merging a plurality of groups of semantically related and temporally adjacent homogeneous segments into a series of semantic units in accordance with the results of both the audio and visual analysis and the keyword extraction. The present invention can be applied to generate important table-of-contents as well as index tables for videos to facilitate efficient video topic searching and browsing.

Citations

16 Claims

1. A method for partitioning a video sequence, comprising:
- dividing a video sequence into a plurality of segments;
  
  generating a transcript of speech content of the video sequence, wherein the transcript comprises a plurality of words and identifies temporal locations of the words in the video sequence;
  
  selecting a plurality of keywords from the plurality of words in the transcript;
  
  selecting a set of keywords from the plurality of keywords, wherein the keywords in the set of keywords are related to each other by meanings of the keywords;
  
  determining a distribution of occurrences across the plurality of segments of the keywords in the set of keywords;
  
  selecting a group of segments from the plurality of segments using the distribution, wherein the segments in the group of segments are temporally adjacent and the group of segments corresponds to a peak of the occurrences across the plurality of segments of the keywords in the set of keywords; and
  
  forming a partition of the video sequence from the group of segments;
  
  wherein generating the transcript of speech content of the video sequence comprisesgenerating the transcript from audio content of the video sequence using automatic speech recognition,determining whether the transcript generated from the audio content is satisfactory,responsive to a determination that the transcript generated from the audio content is not satisfactory, determining whether the video sequence has closed caption, andresponsive to a determination that the video sequence has closed caption, generating the transcript from the closed caption.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein the video sequence comprises frames and wherein dividing the video sequence into the plurality of segments comprises:
    - obtaining color data for the frames;
      
      identifying from the color data temporal locations of abrupt color changes in the video sequence, wherein the locations of abrupt color changes correspond to abrupt color changes between adjacent ones of the frames; and
      
      dividing the video sequence into the plurality of segments at the locations of abrupt color changes.
  - 3. The method of claim 1, further comprising:
    - responsive to a determination that the video sequence does not have closed caption, manually generating the transcript from the audio content.
  - 4. The method of claim 1, wherein each of the keywords is a type of keyword selected from the group of types of keywords consisting of a single word and a phrase.
  - 5. The method of claim 1, further comprising:
    - generating a sound label for each of the plurality of segments, wherein the sound label indicates a class of sound in audio content of a corresponding segment;
      
      generating a visual label for each of the plurality of segments, wherein the visual label indicates a class of visual content of the corresponding segment; and
      
      selecting a one of the plurality of segments as a boundary for the partition using a one of the sound label or the visual label for the selected one of the plurality of segments.

6. A method for partitioning a video sequence, comprising:
- dividing a video sequence into a plurality of segments;
  
  selecting a group of segments from the plurality of segments, wherein the segments in the group of segments are temporally adjacent;
  
  forming a partition of the video sequence from the group of segments;
  
  denoting an end segment, wherein the end segment is a one of the plurality of segments in the group of segments that is located at an end of the group of segments;
  
  determining whether an audio content of any of the plurality of segments around the end segment includes only music or only silence;
  
  responsive to a determination that the audio content of the any of the plurality of segments around the end segment includes only music or only silence, selecting a one of the any of the plurality of segments around the end segment that includes only music or only silence as a boundary for the partition; and
  
  responsive to a determination that the audio content of the any of the plurality of segments around the end segment does not include only music or only silence, locating a one of the plurality of segments around the end segment having visual content including a narrator shot and selecting the one of the plurality of segments having visual content including a narrator shot as the boundary for the partition.

7. An apparatus, comprising:
- a processing unit, wherein the processing unit is configured to;
  
  divide a video sequence into a plurality of segments;
  
  generate a transcript of speech content of the video sequence, wherein the transcript comprises a plurality of words and identifies temporal locations of the words in the video sequence;
  
  select a plurality of keywords from the plurality of words in the transcript;
  
  select a set of keywords from the plurality of keywords, wherein the keywords in the set of keywords are related to each other by meanings of the keywords;
  
  determine a distribution of occurrences across the plurality of segments of the keywords in the set of keywords;
  
  select a group of segments from the plurality of segments using the distribution, wherein the segments in the group of segments are temporally adjacent and the group of segments corresponds to a peak of the occurrences across the plurality of segments of the keywords in the set of keywords;
  
  form a partition of the video sequence from the group of segments;
  
  generate the transcript of speech content of the video sequence from audio content of the video sequence using automatic speech recognition;
  
  determine whether the transcript generated from the audio content is satisfactory;
  
  determine whether the video sequence has closed caption, responsive to a determination that the transcript generated from the audio content is not satisfactory; and
  
  generate the transcript from the closed caption, responsive to a determination that the video sequence has closed caption.
- View Dependent Claims (8, 9, 10)
- - 8. The apparatus of claim 7, wherein the video sequence comprises frames and wherein the processing unit is configured to:
    - obtain color data for the frames;
      
      identify from the color data temporal locations of abrupt color changes in the video sequence, wherein the locations of abrupt color changes correspond to abrupt color changes between adjacent ones of the frames; and
      
      divide the video sequence into the plurality of segments at the locations of abrupt color changes.
  - 9. The apparatus of claim 7, wherein each of the keywords is a type of keyword selected from the group of types of keywords consisting of a single word and a phrase.
  - 10. The apparatus of claim 7, wherein the processing unit is further configured to:
    - generate a sound label for each of the plurality of segments, wherein the sound label indicates a class of sound in audio content of a corresponding segment;
      
      generate a visual label for each of the plurality of segments, wherein the visual label indicates a class of visual content of the corresponding segment; and
      
      select a one of the plurality of segments as a boundary for the partition using a one of the sound label or the visual label for the selected one of the plurality of segments.

11. An apparatus, comprising:
- a processing unit, wherein the processing unit is configured to;
  
  divide a video sequence into a plurality of segments;
  
  select a group of segments from the plurality of segments, wherein the segments in the group of segments are temporally adjacent;
  
  form a partition of the video sequence from the group of segments;
  
  denote an end segment, wherein the end segment is a one of the plurality of segments in the group of segments that is located at an end of the group of segments;
  
  determine whether an audio content of any of the plurality of segments around the end segment includes only music or only silence;
  
  select a one of the any of the plurality of segments around the end segment that includes only music or only silence as a boundary for the partition, responsive to a determination that the audio content of the any of the plurality of segments around the end segment includes only music or only silence;
  
  locate a one of the plurality of segments around the end segment having visual content including a narrator shot; and
  
  select the one of the plurality of segments having visual content including a narrator shot as the boundary for the partition, responsive to a determination that the audio content of the any of the plurality of segments around the end segment does not include only music or only silence.

12. A computer program product for partitioning a video sequence, comprising:
- a non-transitory computer readable storage medium;
  
  first program instructions to divide a video sequence into a plurality of segments;
  
  second program instructions to generate a transcript of speech content of the video sequence, wherein the transcript comprises a plurality of words and identifies temporal locations of the words in the video sequence;
  
  third program instructions to select a plurality of keywords from the plurality of words in the transcript;
  
  fourth program instructions to select a set of keywords from the plurality of keywords, wherein the keywords in the set of keywords are related to each other by meanings of the keywords;
  
  fifth program instructions to determine a distribution of occurrences across the plurality of segments of the keywords in the set of keywords;
  
  sixth program instructions to select a group of segments from the plurality of segments using the distribution, wherein the segments in the group of segments are temporally adjacent and the group of segments corresponds to a peak of the occurrences across the plurality of segments of the keywords in the set of keywords;
  
  seventh program instructions to form a partition of the video sequence from the group of segments; and
  
  wherein the first, second, third, fourth, fifth, sixth, and seventh program instructions are stored on the non-transitory computer readable storage medium,wherein the second program instructions comprise program instructions to;
  
  generate the transcript of speech content of the video sequence from audio content of the video sequence using automatic speech recognition;
  
  determine whether the transcript generated from the audio content is satisfactory;
  
  determine whether the video sequence has closed caption, responsive to a determination that the transcript generated from the audio content is not satisfactory; and
  
  generate the transcript from the closed caption, responsive to a determination that the video sequence has closed caption.
- View Dependent Claims (13, 14, 15)
- - 13. The computer program product of claim 12, wherein the video sequence comprises frames and wherein the first program instructions comprise program instructions to:
    - obtain color data for the frames;
      
      identify from the color data temporal locations of abrupt color changes in the video sequence, wherein the locations of abrupt color changes correspond to abrupt color changes between adjacent ones of the frames; and
      
      divide the video sequence into the plurality of segments at the locations of abrupt color changes.
  - 14. The computer program product of claim 12, wherein each of the keywords is a type of keyword selected from the group of types of keywords consisting of a single word and a phrase.
  - 15. The computer program product of claim 12, further comprising:
    - eighth program instructions to generate a sound label for each of the plurality of segments, wherein the sound label indicates a class of sound in audio content of a corresponding segment;
      
      ninth program instructions to generate a visual label for each of the plurality of segments, wherein the visual label indicates a class of visual content of the corresponding segment;
      
      tenth program instructions to select a one of the plurality of segments as a boundary for the partition using a one of the sound label or the visual label for the selected one of the plurality of segments; and
      
      wherein the eighth, ninth, and tenth program instructions are stored on the non-transitory computer readable storage medium.

16. A computer program product for partitioning a video sequence, comprising:
- a non-transitory computer readable storage medium;
  
  first program instructions to divide a video sequence into a plurality of segments;
  
  second program instructions to select a group of segments from the plurality of segments, wherein the segments in the group of segments are temporally adjacent;
  
  third program instructions to form a partition of the video sequence from the group of segments;
  
  fourth program instructions to denote an end segment, wherein the end segment is a one of the plurality of segments in the group of segments that is located at an end of the group of segments;
  
  fifth program instructions to determine whether an audio content of any of the plurality of segments around the end segment includes only music or only silence;
  
  sixth program instructions to select a one of the any of the plurality of segments around the end segment that includes only music or only silence as a boundary for the partition, responsive to a determination that the audio content of the any of the plurality of segments around the end segment includes only music or only silence;
  
  seventh program instructions to locate a one of the plurality of segments around the end segment having visual content including a narrator shot;
  
  eighth program instructions to select the one of the plurality of segments having visual content including a narrator shot as the boundary for the partition, responsive to a determination that the audio content of the any of the plurality of segments around the end segment does not include only music or only silence; and
  
  wherein the first, second, third, fourth, fifth, sixth, seventh, and eighth program instructions are stored on the non-transitory computer readable storage medium.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Dorai, Chitra, Li, Ying, Park, Youngja
Primary Examiner(s)
Cunningham, Gregory F

Application Number

US12/055,023
Publication Number

US 20080175556A1
Time in Patent Office

1,428 Days
Field of Search

382/276, 704/E15.024, 706/55, 707/739, 707/794, 707/E17.098, 715/723, 717/143, 725/45
US Class Current

382/276
CPC Class Codes

G06F 16/7834 using audio features

G06F 16/7844 using original textual cont...

System and method for semantic video segmentation based on joint audiovisual and text analysis

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for semantic video segmentation based on joint audiovisual and text analysis

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links