System and method for automated multimedia content indexing and retrieval
First Claim
1. A method for automatically indexing and retrieving a multimedia event, comprising:
- separating a multimedia data stream into audio, visual and text components;
segmenting the audio, visual and text components of the multimedia data stream based on semantic differences;
identifying at least one target speaker using the audio and visual components;
identifying semantic boundaries of text for at least one of the identified target speakers to generate semantically coherent text blocks;
generating a summary of multimedia content based on the audio, visual and text components, the semantically coherent text blocks and the identified target speaker;
deriving a topic for each of the semantically coherent text blocks based on a set of topic category models;
generating a multimedia description of the multimedia event based on the identified target speaker, the semantically coherent text blocks, the identified topic, and the generated summary; and
extracting audio features from the audio component of the multimedia data stream, the audio features being at least one of frame-level and clip level features, wherein the frame level features in three subbands are at least one of volume, zero crossing rate, pitch period, frequency centroid, frequency bandwidth, and energy ratios.
4 Assignments
0 Petitions
Accused Products
Abstract
The invention provides a system and method for automatically indexing and retrieving multimedia content. The method may include separating a multimedia data stream into audio, visual and text components, segmenting the audio, visual and text components based on semantic differences, identifying at least one target speaker using the audio and visual components, identifying a topic of the multimedia event using the segmented text and topic category models, generating a summary of the multimedia event based on the audio, visual and text components, the identified topic and the identified target speaker, and generating a multimedia description of the multimedia event based on the identified target speaker, the identified topic, and the generated summary.
259 Citations
8 Claims
-
1. A method for automatically indexing and retrieving a multimedia event, comprising:
-
separating a multimedia data stream into audio, visual and text components;
segmenting the audio, visual and text components of the multimedia data stream based on semantic differences;
identifying at least one target speaker using the audio and visual components;
identifying semantic boundaries of text for at least one of the identified target speakers to generate semantically coherent text blocks;
generating a summary of multimedia content based on the audio, visual and text components, the semantically coherent text blocks and the identified target speaker;
deriving a topic for each of the semantically coherent text blocks based on a set of topic category models;
generating a multimedia description of the multimedia event based on the identified target speaker, the semantically coherent text blocks, the identified topic, and the generated summary; and
extracting audio features from the audio component of the multimedia data stream, the audio features being at least one of frame-level and clip level features, wherein the frame level features in three subbands are at least one of volume, zero crossing rate, pitch period, frequency centroid, frequency bandwidth, and energy ratios.
-
-
2. A method for automatically indexing and retrieving a multimedia event, comprising:
-
separating a multimedia data stream into audio, visual and text components;
segmenting the audio, visual and text components of the multimedia data stream based on semantic differences;
identifying at least one target speaker using the audio and visual components;
identifying semantic boundaries of text for at least one of the identified target speakers to generate semantically coherent text blocks;
generating a summary of multimedia content based on the audio, visual and text components, the semantically coherent text blocks and the identified target speaker;
deriving a topic for each of the semantically coherent text blocks based on a set of topic category models;
generating a multimedia description of the multimedia event based on the identified target speaker, the semantically coherent text blocks, the identified topic, and the generated summary; and
extracting audio features from the audio component of the multimedia data stream, the audio features being at least one of frame-level and clip level features, wherein clip level features are classified as at least one of time domain features and frequency domain features. - View Dependent Claims (3, 4)
-
-
5. A system that automatically indexes and retrieves a multimedia event, comprising:
-
a multimedia data stream separation unit that separates a multimedia data stream into audio, visual and text components;
a data stream component segmentation unit that segments the audio, visual and text components of the multimedia data stream based on semantic differences;
a target speaker detection unit that identifies at least one target speaker using the audio and visual components;
a content segmentation unit that identifies semantic boundaries of text, for at least one of the identified target speakers, to generate semantically coherent text blocks;
a summary generator that generates a summary of multimedia content based on the audio, visual and text components, the semantically coherent text blocks and the identified target speaker;
a topic categorization unit that derives a topic for each of the semantically coherent text blocks based on a set of topic category models;
a multimedia description generator that generates a multimedia description of the multimedia event based on the identified target speaker, the semantically coherent text blocks, the identified topic, and the generated summary; and
a feature extraction unit that extracts audio features from the audio component of the multimedia data stream, the audio features being at least one of frame-level and clip level features, wherein the frame level features in three subbands are at least one of volume, zero crossing rate, pitch period, frequency centroid, frequency bandwidth and energy ratios. - View Dependent Claims (6, 7, 8)
-
Specification