Method for automatic analysis of audio including music and speech
First Claim
1. A method for identifying novelty points in a source audio signal comprising the steps of:
- sampling the audio signal and dividing the audio signal into windowed portions with a plurality of samples taken from within each of the windowed portions;
parameterizing the windowed portions of the audio signal by applying a first function to each windowed portion to form a vector parameter for each window; and
embedding the parameters by applying a second function which provides a measurement of similarity between the parameters.
6 Assignments
0 Petitions
Accused Products
Abstract
A method for determining points of change or novelty in an audio signal measures the self similarity of components of the audio signal. For each time window in an audio signal, a formula is used to determine a vector parameterization value. The self-similarity as well as cross-similarity between each of the parameterization values is then determined for all past and future window regions. A significant point of novelty or change will have a high self-similarity in the past and future, and a low cross-similarity. The extent of the time difference between “past” and “future” can be varied to change the scale of the system so that, for example, individual musical notes can be found using a short time extent while longer events, such as musical themes or changing of speakers, can be identified by considering windows further into the past or future. The result is a measure of the degree of change, or how novel the source audio is at any time. The method can be used in a wide variety of applications, including segmenting or indexing for classification and retrieval, beat tracking, and summarizing of speech or music.
351 Citations
36 Claims
-
1. A method for identifying novelty points in a source audio signal comprising the steps of:
-
sampling the audio signal and dividing the audio signal into windowed portions with a plurality of samples taken from within each of the windowed portions;
parameterizing the windowed portions of the audio signal by applying a first function to each windowed portion to form a vector parameter for each window; and
embedding the parameters by applying a second function which provides a measurement of similarity between the parameters. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
-
-
6. The method of claim 1, wherein the second function used in the embedding step comprises a dot product:
-
7. The method of claim 1, wherein the second function used in the embedding step comprises a normalized dot product:
-
8. The method of claim 1, wherein the embedded parameters are provided in a form of a matrix S(i,j), wherein i identifies rows of the matrix and j identifies columns of the matrix, the method further comprising the step of:
identifying a slanted domain matrix L(i,l) from the matrix S(i,j), wherein l is a lag value l=i−
j.
-
9. The method of claim 1, wherein the embedded parameters are provided in a form of a matrix S, the method further comprising the step of:
correlating the matrix S with a matrix kernel C to determine a novelty score.
-
10. The method of claim 9, wherein the matrix kernel C comprises a 2×
- 2 checkerboard kernel defined as follows;
- 2 checkerboard kernel defined as follows;
-
11. The method of claim 9, wherein the matrix kernel C comprises a coherence kernel defined as follows:
-
12. The method-of claim 9, wherein the matrix kernel C comprises an anti-coherence kernel defined as follows:
-
13. The method of claim 9, wherein the matrix kernel C comprises a checkerboard kernel including four quadrants with ones (1s) in two opposing quadrants and negative ones (−
- 1s) in two opposing quadrants.
-
14. The method of claim 13, wherein the matrix kernel C comprises the checkerboard kernel smoothed using a function which tapers toward-zero at the edges of the matrix kernel C.
-
15. The method of claim 14, wherein the function comprises a radially-symmetrical Gaussian taper.
-
16. The method of claim 9, wherein the matrix kernel C comprises a checkerboard kernel including four quadrants with ones (1s) in two opposing quadrants and zeros (0s) in two opposing quadrants.
-
17. The method of claim 9, wherein the novelty score D(i), where i is a frame number, is determined as follows:
-
wherein the matrix kernel C has a width of L and is centered at m=0,n=0.
-
-
18. The method of claim 9, further comprising the step of:
thresholding the novelty score by determining points in the novelty score above a predetermined threshold.
-
19. The method of claim 18 further comprising the step of forming a binary tree structure from the points in the novelty score above the predetermined threshold by performing the steps of:
-
identifying one of the points above the threshold which is highest above the threshold as a root of the binary tree and dividing remaining points from the novelty score above the predetermined threshold into first left and first right points relative to the root point;
identifying a first left subsequent tree point which is highest above the threshold in the first left points and dividing remaining points from the first left points into second left and second right points relative to the left subsequent tree point; and
identifying a first right subsequent tree point which is highest above the threshold in the first right points and dividing remaining points from the first right points into third left and third right points relative to the right subsequent tree point.
-
-
20. The method of claim 18 further comprising the steps of:
-
defining segments in the audio signal as groups of points in the novelty score above the threshold with each point in the group adjacent to a point above the threshold; and
audio gisting by averaging points in each of the segments and identifying a number of segments with points in the novelty score which are most similar.
-
-
21. The method of claim 18 further comprising the steps of:
-
defining segments in the audio signal as groups of points in the novelty score above the threshold with each point in the group adjacent to a point above the threshold; and
indexing the audio signal by defining numbers to identify locations of individual ones of the segments.
-
-
22. The method of claim 21 further comprising the step of:
browsing the audio signal by playing a portion of the audio signal at the beginning of each segment.
-
23. The method of claim 18 further comprising the steps of:
-
defining segments in the audio signal as groups of points in the novelty score above the threshold with each point in the group adjacent to a point above the threshold; and
editing the audio signal by cutting portions of the audio signal which are not part of the segments.
-
-
24. The method of claim 18 further comprising the steps of:
-
defining segments in the audio signal as groups of points in the novelty score above the threshold with each point in the group adjacent to a point above the threshold; and
warping the audio signal so that segment boundaries occur at predetermined times in the audio signal.
-
-
25. The method of claim 18 further comprising the steps of:
-
defining segments in the audio signal as groups of points in the novelty score above the threshold with each point in the group adjacent to a point above the threshold; and
aligning portions of a video signal to the audio signal based on locations of the segments.
-
-
26. The method of claim 9, further comprising the steps of:
-
defining a beat spectrum by summing points forming a diagonal line in the matrix S;
correlating the novelty score with the beat spectrum; and
identifying peaks in the correlated novelty score and beat spectrum to determine a tempo for music in the audio signal source.
-
-
27. The method of claim 1, wherein the embedded parameters are provided in a form of a matrix S, the method further comprising the steps of:
-
correlating the matrix S with a coherence kernel to determine a first novelty-score;
correlating the matrix S with an anti-coherence kernel to determine a second novelty score; and
determining a difference between the first novelty score and the second novelty score, wherein the coherence kernel is defined as follows;
and, wherein the anti-coherence kernel is defined as follows;
-
-
28. The method of claim 1, wherein the embedded parameters are provided in a form of a matrix S, the method further comprising the steps of:
-
correlating the matrix S with a coherence kernel to determine a first novelty score;
correlating the matrix S with an anti-coherence kernel to determine a second novelty score; and
determining a difference between the first novelty score and the second novelty score, wherein the coherence and the anti-coherence kernels each comprise four quadrants with ones (1s) in two opposing quadrants and zeros (0s) in two opposing quadrants, wherein the ones (1s) in the coherence kernel are in opposing quadrants from the ones (1s) in the anti-coherence kernel.
-
-
29. The method of claim 1, wherein the embedded parameters are provided in a form of a matrix S, the method further comprising the step of:
defining a beat spectrum by summing points forming a diagonal line in the matrix.
-
30. The method of claim 29, wherein the diagonal line is a main diagonal of the matrix S.
-
31. The method of claim 29, wherein the diagonal line is a sub-diagonal of the matrix S, wherein the sub-diagonal is parallel to a main diagonal.
-
32. The method of claim 29 further comprising the step of:
identifying peaks in the beat spectrum to determine a tempo for music in the audio signal source.
-
33. The method of claim 32 further comprising the steps of:
warping the audio signal so that the tempo of the audio signal matches a tempo of a second audio signal.
-
34. The method of claim 1, wherein the embedded parameters are provided in a form of a matrix S(i,j), wherein i identifies rows of the matrix and j identifies columns of the matrix, the method further comprising the step of:
auto-correlating the matrix S(i,j) to define a beat spectrum B(k,l), wherein k and l are predetermined integers, as follows;
-
35. The method of claim 34 further comprising the step of:
identifying peaks in the beat spectrum to determine a tempo for music in the audio signal source.
-
36. The method of claim 1, wherein in the step of embedding the parameters, the second function is applied to determine self-similarity between one of the parameters and itself, as well as cross similarity between two different ones of the parameters.
Specification