Multimedia search apparatus and method for searching multimedia content using speaker detection by audio data

US 6,405,166 B1
Filed: 10/15/2001
Issued: 06/11/2002
Est. Priority Date: 08/13/1998
Status: Expired due to Term

First Claim

Patent Images

1. A method of segmenting multimedia data using audio information, comprising:

receiving a search request identifying at least one target speaker;

retrieving at least one model for the at least one target speaker; and

segmenting the multimedia data into one or more target speaker segments and background segments based on feature vectors of the multimedia data and the at least one model for the at least one target speaker, wherein the step of segmenting comprises;

reading a first block of frames of the multimedia data;

determining a score for the first block of frames based on the at least one model for the at least one target speaker; and

determining if the score for the first block of frames is above or below a first threshold.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A multimedia search apparatus and method for searching multimedia content using speaker detection to segment the multimedia content. The multimedia search apparatus receives a search request from a user device. The search request identifies the target speaker for which the search is to be conducted. Based on the search request, the multimedia search apparatus retrieves multimedia content from a multimedia database. The multimedia search apparatus retrieves models, such as Gaussian Mixture Models (GMMs), from a model storage device, corresponding to the target speaker and background data. Based on the retrieved models, the multimedia search device searches the multimedia data of the multimedia content and segments the multimedia data. The segments are identified by calculating an average normalized score for a block of frames of the multimedia data and determining if the average normalized score for the block of frames exceeds one or more predetermined thresholds.

Citations

34 Claims

1. A method of segmenting multimedia data using audio information, comprising:
- receiving a search request identifying at least one target speaker;
  
  retrieving at least one model for the at least one target speaker; and
  
  segmenting the multimedia data into one or more target speaker segments and background segments based on feature vectors of the multimedia data and the at least one model for the at least one target speaker, wherein the step of segmenting comprises;
  
  reading a first block of frames of the multimedia data;
  
  determining a score for the first block of frames based on the at least one model for the at least one target speaker; and
  
  determining if the score for the first block of frames is above or below a first threshold.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, further comprising:
3. The method of claim 1, further comprising:
- identifying a tentative start point of a target speaker segment if the score for the first block of frames is above the first threshold; and
  
  identifying a tentative end point of a target speaker segment if the score for the first block of frames is below the first threshold.
4. The method of claim 3, further comprising:
- reading a second block of frames of the audio data;
  
  determining a score for the second block of frames based on the model for the target speaker;
  
  verifying the tentative start point of the target speaker segment if the score for the second block of frames is above a second threshold; and
  
  verifying the tentative end point of the target speaker segment if the score for the second block of frames is below a third threshold.
5. The method of claim 1, wherein the score is a normalized score.
6. The method of claim 5, wherein the normalized score is calculated based on the model for the target speaker and one or more background data models.
7. The method of claim 1, wherein the score is an averaged normalized score for the first block of frames.
8. The method of claim 1, further comprising:
- sending at least one of (a) at least a portion of the target speaker segments and (b) at least a portion of the background segments to a user device from which the search request was received to enable the user device to reproduce a multimedia presentation incorporating the at least one of (a) the at least a portion of target speaker segments and (b) the at least a portion of the background segments.
9. The method of claim 8, wherein the user device is one of a computer, a wired telephone, a wireless telephone, a Web TV™
- terminal, and a Personal Digital Assistant.
10. The method of claim 1, wherein the at least one model for the at least one target speaker is a Gaussian Mixture Model.
11. The method of claim 1, wherein the at least one model for the at least one target speaker is a vector quantization codebook model.
12. The method of claim 1, wherein the at least one model for the at least one target speaker is a hidden Markov model.
13. The method of claim 1, further comprising retrieving at least one model for background, wherein the step of segmenting includes segmenting the multimedia data into the one or more target speaker segments and the background segments based on the at least one model for the background.
14. The method of claim 13, wherein the at least one model for the background is a Gaussian Mixture Model.
15. The method of claim 13, wherein the at least one model for the background is a vector quantization codebook model.
16. The method of claim 13, wherein the at least one model for the background is a hidden Markov model.
17. A user device that receives at least one of (a) at least a portion of the target speaker segments and (b) at least a portion of the background segments that are segmented by the method of claim 1 and reproduces a multimedia presentation incorporating the at least one of (a) the art least a portion of the target speaker segments and (b) the at least a portion of the background segments.
18. The user device of claim 17, wherein the user device is one of a computer, a wired telephone, a wireless telephone, a WebTV™
- terminal, and a Personal Digital Assistant.

19. An apparatus that identifies segments of multimedia data for retrieval, comprising:
- a controller;
  
  a network interface; and
  
  a memory, wherein the controller receives a search request via the network interface identifying at least one target speaker, retrieves at least one model for the at least one target speaker from the memory, and segments the multimedia data into one or more target speaker segments and background segments based on feature vectors of the multimedia data and the at least one model for the at least one target speaker;
  
  wherein the controller segments the multimedia data by reading a first block of frames of the multimedia data, determining a score for the first block of frames based on the at least one model for the at least one target speaker, and determining if the score is above or below a first threshold.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
- - 20. The apparatus of claim 19, wherein the controller identifies the first block of frames as part of a target speaker segment if the score is above the predetermined threshold and identifies the first block of frames as part of a background segment if the score is below the predetermined threshold.
  - 21. The apparatus of claim 19, wherein the controller identifies a tentative start point of a target speaker segment if the score is above the first threshold and identifies a tentative end point of a target speaker segment if the score is below the first threshold.
  - 22. The apparatus of claim 21, wherein the controller reads a second block of frames of the audio data, determines a score for the second block of frames based on the model for the target speaker, verifies the tentative start point of the target speaker segment if the score for the second block of frames is above a second threshold, and verifies the tentative end point of the target speaker segment if the score for the second block of frames is below a third threshold.
  - 23. The apparatus of claim 19, wherein the score is a normalized score.
  - 24. The apparatus of claim 23, wherein the normalized score is calculated based on the model for the target speaker and one or more background data models.
  - 25. The apparatus of claim 19, wherein the score is an averaged normalized score for the first block of frames.
  - 26. The apparatus of claim 19, wherein the controller sends at least one of (a) at least a portion of the target speaker segments and (b) at least a portion of the background segments to a user device from which the search request was received to enable the user device to reproduce a multimedia presentation incorporating the at least one of (a) the at least a portion of target speaker segments and (b) the at least a portion of background segments.
  - 27. The apparatus of claim 26, wherein the user device is one of a computer, a wired telephone, a wireless telephone, a Web TV™
    - terminal, and a Personal Digital Assistant.
  - 28. The apparatus of claim 19, wherein the at least one model for the at least one target speaker is a Gaussian Mixture Model.
  - 29. The apparatus of claim 19, wherein the at least one model for the at least one target speaker is a vector quantization codebook model.
  - 30. The apparatus of claim 19, wherein the at least one model for the at least one target speaker is a hidden Markov model.
  - 31. The apparatus of claim 19, wherein the controller retrieves at least one model for background and segments the multimedia data into the one or more target speaker segments and the background segments based on the at least one model for the background.
  - 32. The apparatus of claim 31, wherein the at least one model for the background is a Gaussian Mixture Model.
  - 33. The apparatus of claim 31, wherein the at least one model for the background is a vector quantization codebook model.
  - 34. The apparatus of claim 31, wherein the at least one model for the background is a hidden Markov model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
AT&T Corporation (AT&T, Inc.)
Inventors
Parthasarathy, Sarangarajan, Magrin-Chagnolleau, Ivan, Huang, Qian, Rosenberg, Aaron Edward
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
Nolan, Daniel A.

Application Number

US09/976,023
Publication Number

US 20020029144A1
Time in Patent Office

239 Days
Field of Search

704/231, 704/236, 704/239, 704/243-247
US Class Current

704/246
CPC Class Codes

G10L 17/00 Speaker identification or v...

Multimedia search apparatus and method for searching multimedia content using speaker detection by audio data

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

34 Claims

Specification

Solutions

Use Cases

Quick Links

Multimedia search apparatus and method for searching multimedia content using speaker detection by audio data

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

34 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links