Method for detecting scene boundaries in genre independent videos
First Claim
Patent Images
1. A computer implemented method for detecting scene boundaries in videos, comprising the steps of:
- extracting audio features from an audio signal of the videos, wherein the audio features are Mel-frequency cepstral coefficients, and the audio signals are classified into semantic classes, and wherein each feature vector includes variable x1, x2, x3 indicating a number of the audio classes labels within a time window of duration [t−
WL, t], where WL is about fourteen seconds, and variables x4, x5, x6 indicating a number of the audio classes in a window of duration and variables x7, x8, x9 indicate a number of audio classes within a window [t, t+WL], and variables x10, x11 are a Bhattacharyya shape and a Mahalanobis distance between the MFCC coefficients for the window [t−
WL, t] and window [t, t+WL], respectively, and variable x12 is twice an average number of shot boundaries present in the video within a window [t−
WL, t+WL];
extracting visual features from frames of the videos;
combining the audio and visual features into the feature vectors,extracting the feature vectors from videos of different genres;
classifying the feature vectors as scene boundaries using a support vector machine, and in which the support vector machine is trained to be independent of the different genre of the videos; and
segmenting the videos according to the scene boundaries, wherein the steps are performed in a computer.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer implemented method detects scene boundaries in videos by first extracting feature vectors from videos of different genres. The feature vectors are then classified as scene boundaries using a support vector machine. The support vector machine is trained to be independent of the different genres of the videos.
16 Citations
10 Claims
-
1. A computer implemented method for detecting scene boundaries in videos, comprising the steps of:
-
extracting audio features from an audio signal of the videos, wherein the audio features are Mel-frequency cepstral coefficients, and the audio signals are classified into semantic classes, and wherein each feature vector includes variable x1, x2, x3 indicating a number of the audio classes labels within a time window of duration [t−
WL, t], where WL is about fourteen seconds, and variables x4, x5, x6 indicating a number of the audio classes in a window of durationand variables x7, x8, x9 indicate a number of audio classes within a window [t, t+WL], and variables x10, x11 are a Bhattacharyya shape and a Mahalanobis distance between the MFCC coefficients for the window [t−
WL, t] and window [t, t+WL], respectively, and variable x12 is twice an average number of shot boundaries present in the video within a window [t−
WL, t+WL];extracting visual features from frames of the videos; combining the audio and visual features into the feature vectors, extracting the feature vectors from videos of different genres; classifying the feature vectors as scene boundaries using a support vector machine, and in which the support vector machine is trained to be independent of the different genre of the videos; and segmenting the videos according to the scene boundaries, wherein the steps are performed in a computer. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
and the Mahalanobis distance is where covariance matrices Ci and Cj, and means μ
i and μ
j represent diagonal covariances and means of the MFCC vectors before and after the time t.
-
-
9. The method of claim 1, further comprising:
transforming the feature vectors to a higher dimensional feature space using a kernel function.
-
10. The method of claim 9, in which the kernel function is a radial basis kernel.
Specification