System and method for automated multimedia content indexing and retrieval

US 6,714,909 B1
Filed: 11/21/2000
Issued: 03/30/2004
Est. Priority Date: 08/13/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A method for automatically indexing and retrieving a multimedia event, comprising:

separating a multimedia data stream into audio, visual and text components;

segmenting the audio, visual and text components of the multimedia data stream based on semantic differences;

identifying at least one target speaker using the audio and visual components;

identifying semantic boundaries of text for at least one of the identified target speakers to generate semantically coherent text blocks;

generating a summary of multimedia content based on the audio, visual and text components, the semantically coherent text blocks and the identified target speaker;

deriving a topic for each of the semantically coherent text blocks based on a set of topic category models;

generating a multimedia description of the multimedia event based on the identified target speaker, the semantically coherent text blocks, the identified topic, and the generated summary; and

extracting audio features from the audio component of the multimedia data stream, the audio features being at least one of frame-level and clip level features, wherein the frame level features in three subbands are at least one of volume, zero crossing rate, pitch period, frequency centroid, frequency bandwidth, and energy ratios.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention provides a system and method for automatically indexing and retrieving multimedia content. The method may include separating a multimedia data stream into audio, visual and text components, segmenting the audio, visual and text components based on semantic differences, identifying at least one target speaker using the audio and visual components, identifying a topic of the multimedia event using the segmented text and topic category models, generating a summary of the multimedia event based on the audio, visual and text components, the identified topic and the identified target speaker, and generating a multimedia description of the multimedia event based on the identified target speaker, the identified topic, and the generated summary.

259 Citations

8 Claims

1. A method for automatically indexing and retrieving a multimedia event, comprising:
- separating a multimedia data stream into audio, visual and text components;
  
  segmenting the audio, visual and text components of the multimedia data stream based on semantic differences;
  
  identifying at least one target speaker using the audio and visual components;
  
  identifying semantic boundaries of text for at least one of the identified target speakers to generate semantically coherent text blocks;
  
  generating a summary of multimedia content based on the audio, visual and text components, the semantically coherent text blocks and the identified target speaker;
  
  deriving a topic for each of the semantically coherent text blocks based on a set of topic category models;
  
  generating a multimedia description of the multimedia event based on the identified target speaker, the semantically coherent text blocks, the identified topic, and the generated summary; and
  
  extracting audio features from the audio component of the multimedia data stream, the audio features being at least one of frame-level and clip level features, wherein the frame level features in three subbands are at least one of volume, zero crossing rate, pitch period, frequency centroid, frequency bandwidth, and energy ratios.

2. A method for automatically indexing and retrieving a multimedia event, comprising:
- separating a multimedia data stream into audio, visual and text components;
  
  segmenting the audio, visual and text components of the multimedia data stream based on semantic differences;
  
  identifying at least one target speaker using the audio and visual components;
  
  identifying semantic boundaries of text for at least one of the identified target speakers to generate semantically coherent text blocks;
  
  generating a summary of multimedia content based on the audio, visual and text components, the semantically coherent text blocks and the identified target speaker;
  
  deriving a topic for each of the semantically coherent text blocks based on a set of topic category models;
  
  generating a multimedia description of the multimedia event based on the identified target speaker, the semantically coherent text blocks, the identified topic, and the generated summary; and
  
  extracting audio features from the audio component of the multimedia data stream, the audio features being at least one of frame-level and clip level features, wherein clip level features are classified as at least one of time domain features and frequency domain features.
- View Dependent Claims (3, 4)
- - 3. The method of claim 2, wherein the time domain features are at least one of non-silence ratio, volume standard deviation, standard deviation of zero crossing rate, volume dynamic range, volume undulation, and 4Hz modulation energy.
  - 4. The method of claim 2, wherein the frequency domain features in three subbands are at least one of standard deviation of the pitch period, smooth pitch ratio, non-pitch ratio, frequency centroid, frequency bandwidth, and energy ratios.

5. A system that automatically indexes and retrieves a multimedia event, comprising:
- a multimedia data stream separation unit that separates a multimedia data stream into audio, visual and text components;
  
  a data stream component segmentation unit that segments the audio, visual and text components of the multimedia data stream based on semantic differences;
  
  a target speaker detection unit that identifies at least one target speaker using the audio and visual components;
  
  a content segmentation unit that identifies semantic boundaries of text, for at least one of the identified target speakers, to generate semantically coherent text blocks;
  
  a summary generator that generates a summary of multimedia content based on the audio, visual and text components, the semantically coherent text blocks and the identified target speaker;
  
  a topic categorization unit that derives a topic for each of the semantically coherent text blocks based on a set of topic category models;
  
  a multimedia description generator that generates a multimedia description of the multimedia event based on the identified target speaker, the semantically coherent text blocks, the identified topic, and the generated summary; and
  
  a feature extraction unit that extracts audio features from the audio component of the multimedia data stream, the audio features being at least one of frame-level and clip level features, wherein the frame level features in three subbands are at least one of volume, zero crossing rate, pitch period, frequency centroid, frequency bandwidth and energy ratios.
- View Dependent Claims (6, 7, 8)
- - 6. The system of claim 5, wherein clip level features are classified as at least one of time domain features and frequency domain features.
  - 7. The system of claim 6, herein the time domain features are at least one of non-silence ratio, volume standard deviation, standard deviation of zero crossing rate, volume dynamic range, volume undulation, and 4 Hz modulation energy.
  - 8. The system of claim 6, wherein the frequency domain features in three subbands are at least one of standard deviation of the pitch period, smooth pitch ratio, non-pitch ratio, frequency centroid, frequency bandwidth, and energy ratios.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
AT&T Corporation (AT&T, Inc.)
Inventors
Gibbon, David Crawford, Shahraray, Behzad, Liu, Zhu, Huang, Qian, Rosenberg, Aaron Edward
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
NOLAN, DANIEL A

Application Number

US09/716,278
Time in Patent Office

1,225 Days
Field of Search

704/246, 704/260, 704/251, 704/270.1, 382/309, 345/719, 345/749, 725/134, 707/6, 707/3, 707/102
US Class Current

704/246
CPC Class Codes

G06F 16/739   in form of a video summary,...

G06F 16/7834   using audio features

G06F 16/7844   using original textual cont...

G10L 17/00   Speaker identification or v...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99943   Generating database or data...

System and method for automated multimedia content indexing and retrieval

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

259 Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for automated multimedia content indexing and retrieval

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

259 Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links