Multimedia search apparatus and method for searching multimedia content using speaker detection by audio data
First Claim
1. A method of segmenting multimedia data using audio information, comprising:
- receiving a search request identifying at least one target speaker;
retrieving at least one model for the at least one target speaker; and
segmenting the multimedia data into one or more target speaker segments and background segments based on feature vectors of the multimedia data and the at least one model for the at least one target speaker, wherein the step of segmenting comprises;
reading a first block of frames of the multimedia data;
determining a score for the first block of frames based on the at least one model for the at least one target speaker; and
determining if the score for the first block of frames is above or below a first threshold.
4 Assignments
0 Petitions
Accused Products
Abstract
A multimedia search apparatus and method for searching multimedia content using speaker detection to segment the multimedia content. The multimedia search apparatus receives a search request from a user device. The search request identifies the target speaker for which the search is to be conducted. Based on the search request, the multimedia search apparatus retrieves multimedia content from a multimedia database. The multimedia search apparatus retrieves models, such as Gaussian Mixture Models (GMMs), from a model storage device, corresponding to the target speaker and background data. Based on the retrieved models, the multimedia search device searches the multimedia data of the multimedia content and segments the multimedia data. The segments are identified by calculating an average normalized score for a block of frames of the multimedia data and determining if the average normalized score for the block of frames exceeds one or more predetermined thresholds.
-
Citations
34 Claims
-
1. A method of segmenting multimedia data using audio information, comprising:
-
receiving a search request identifying at least one target speaker;
retrieving at least one model for the at least one target speaker; and
segmenting the multimedia data into one or more target speaker segments and background segments based on feature vectors of the multimedia data and the at least one model for the at least one target speaker, wherein the step of segmenting comprises;
reading a first block of frames of the multimedia data;
determining a score for the first block of frames based on the at least one model for the at least one target speaker; and
determining if the score for the first block of frames is above or below a first threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
identifying the first block of frames as part of a target speaker segment if the score for the block of frames is above the predetermined threshold; and
identifying the first block of frames as part of a background segment if the score for the block of frames is below the predetermined threshold.
-
-
3. The method of claim 1, further comprising:
-
identifying a tentative start point of a target speaker segment if the score for the first block of frames is above the first threshold; and
identifying a tentative end point of a target speaker segment if the score for the first block of frames is below the first threshold.
-
-
4. The method of claim 3, further comprising:
-
reading a second block of frames of the audio data;
determining a score for the second block of frames based on the model for the target speaker;
verifying the tentative start point of the target speaker segment if the score for the second block of frames is above a second threshold; and
verifying the tentative end point of the target speaker segment if the score for the second block of frames is below a third threshold.
-
-
5. The method of claim 1, wherein the score is a normalized score.
-
6. The method of claim 5, wherein the normalized score is calculated based on the model for the target speaker and one or more background data models.
-
7. The method of claim 1, wherein the score is an averaged normalized score for the first block of frames.
-
8. The method of claim 1, further comprising:
sending at least one of (a) at least a portion of the target speaker segments and (b) at least a portion of the background segments to a user device from which the search request was received to enable the user device to reproduce a multimedia presentation incorporating the at least one of (a) the at least a portion of target speaker segments and (b) the at least a portion of the background segments.
-
9. The method of claim 8, wherein the user device is one of a computer, a wired telephone, a wireless telephone, a Web TV™
- terminal, and a Personal Digital Assistant.
-
10. The method of claim 1, wherein the at least one model for the at least one target speaker is a Gaussian Mixture Model.
-
11. The method of claim 1, wherein the at least one model for the at least one target speaker is a vector quantization codebook model.
-
12. The method of claim 1, wherein the at least one model for the at least one target speaker is a hidden Markov model.
-
13. The method of claim 1, further comprising retrieving at least one model for background, wherein the step of segmenting includes segmenting the multimedia data into the one or more target speaker segments and the background segments based on the at least one model for the background.
-
14. The method of claim 13, wherein the at least one model for the background is a Gaussian Mixture Model.
-
15. The method of claim 13, wherein the at least one model for the background is a vector quantization codebook model.
-
16. The method of claim 13, wherein the at least one model for the background is a hidden Markov model.
-
17. A user device that receives at least one of (a) at least a portion of the target speaker segments and (b) at least a portion of the background segments that are segmented by the method of claim 1 and reproduces a multimedia presentation incorporating the at least one of (a) the art least a portion of the target speaker segments and (b) the at least a portion of the background segments.
-
18. The user device of claim 17, wherein the user device is one of a computer, a wired telephone, a wireless telephone, a WebTV™
- terminal, and a Personal Digital Assistant.
-
19. An apparatus that identifies segments of multimedia data for retrieval, comprising:
-
a controller;
a network interface; and
a memory, wherein the controller receives a search request via the network interface identifying at least one target speaker, retrieves at least one model for the at least one target speaker from the memory, and segments the multimedia data into one or more target speaker segments and background segments based on feature vectors of the multimedia data and the at least one model for the at least one target speaker;
wherein the controller segments the multimedia data by reading a first block of frames of the multimedia data, determining a score for the first block of frames based on the at least one model for the at least one target speaker, and determining if the score is above or below a first threshold. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
-
Specification