System and method for automatic audio content analysis for word spotting, indexing, classification and retrieval
First Claim
1. A computer-implemented method for analyzing an audio signal, comprising:
- detecting audio events in one or more intervals of the audio signal, each interval including a temporal sequence of one or more segments;
indexing the audio signal based on the audio events; and
skimming, gisting, or summarizing the audio signal using the indexing thereof.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for indexing an audio stream for subsequent information retrieval and for skimming, gisting, and summarizing the audio stream includes using special audio prefiltering such that only relevant speech segments that are generated by a speech recognition engine are indexed. Specific indexing features are disclosed that improve the precision and recall of an information retrieval system used after indexing for word spotting. The invention includes rendering the audio stream into intervals, with each interval including one or more segments. For each segment of an interval it is determined whether the segment exhibits one or more predetermined audio features such as a particular range of zero crossing rates, a particular range of energy, and a particular range of spectral energy concentration. The audio features are heuristically determined to represent respective audio events including silence, music, speech, and speech on music. Also, it is determined whether a group of intervals matches a heuristically predefined meta pattern such as continuous uninterrupted speech, concluding ideas, hesitations and emphasis in speech, and so on, and the audio stream is then indexed based on the interval classification and meta pattern matching, with only relevant features being indexed to improve subsequent precision of information retrieval. Also, alternatives for longer terms generated by the speech recognition engine are indexed along with respective weights, to improve subsequent recall.
-
Citations
36 Claims
-
1. A computer-implemented method for analyzing an audio signal, comprising:
-
detecting audio events in one or more intervals of the audio signal, each interval including a temporal sequence of one or more segments;
indexing the audio signal based on the audio events; and
skimming, gisting, or summarizing the audio signal using the indexing thereof. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
processing only relevant portions of the audio signal using a speech recognition engine to render words from the signal;
receiving, from the engine, alternatives to at least some of the words;
receiving, from the engine, confidence levels for at least some of the words and alternatives; and
indexing the words and alternatives based at least in part on the confidence levels.
-
-
3. The method of claim 2, wherein alternatives are received only for words longer than “
- N”
characters and having a confidence of greater than “
x”
percent.
- N”
-
4. The method of claim 3, wherein the words and alternatives are indexed based on respective weights.
-
5. The method of claim 1, further comprising heuristically defining the audio events.
-
6. The method of claim 1, wherein the detecting step comprises:
-
determining whether the segments of an interval exhibit one or more predetermined audio features, each audio feature being representative of at least one respective audio event, the audio events including at least music and speech;
classifying the intervals by associating the intervals with respective audio events in response to the means for determining;
determining whether at least one group of intervals matches a meta pattern in a predefined set of meta patterns; and
associating the group of intervals with a meta pattern classification when it is determined that the group of intervals matches a meta pattern, wherein the indexing of the audio signal is undertaken based on the interval classification and the meta pattern classification.
-
-
7. The method of claim 6, wherein each predetermined audio feature is based on one or more of:
- zero crossing rate of at least a portion of the audio signal, energy of at least a portion of the audio signal, spectral energy concentration of at least a portion of the audio signal, and frequency of at least a portion of the audio signal.
-
8. The method of claim 6, wherein the predefined set of audio events further comprises silence, speech on music, emphasis in speech, hesitation in speech, and concluding ideas in speech.
-
9. The method of claim 6, further comprising:
normalizing the segments, prior to the classifying step.
-
10. The method of claim 6, wherein the step of determining whether the segments of an interval exhibit one or more predetermined audio features further includes:
-
determining, for each segment in an interval, whether one or more audio features associated with the segment equals a respective threshold;
incrementing respective one or more counters associated with the one or more audio features when the respective features equal respective thresholds; and
comparing the one or more counters to the number of segments in the interval, the logic means for classifying the intervals undertaking the classifying of intervals based on the comparing step.
-
-
11. The method of claim 10, further comprising:
-
determining one or more dominant frequencies in at least one interval classified as speech during the step of classifying the intervals;
associating one or more segments with emphasis in speech when the one or more segments includes a top N% of the dominant frequencies, wherein N is a number; and
associating one or more segments with concluding ideas in speech when the one or more segments includes a bottom N% of the dominant frequencies, wherein N is a number.
-
-
12. The method of claim 11, further comprising determining whether temporally sequential segments, all associated with emphasis in speech, define a period greater than a predetermined period, and if so, defining and indexing the temporally sequential segments as an important idea in speech.
-
13. A computer-implemented method for analyzing an audio signal, comprising:
-
detecting audio events in one or more intervals of the audio signal, each interval including a temporal sequence of one or more segments;
analyzing the audio events to identify speech boundaries with associated speech confidence levels;
indexing the audio signal based on the speech boundaries and confidence levels using heuristically determined rules to improve precision;
indexing alternatives to at least one recognized word in the audio signal along with an associated weight to improve recall; and
undertaking one or more of;
word spotting, summarizing, and skimming, the audio signal using the indexing thereof.
-
-
14. A computer including a data storage device including a computer usable medium having computer usable code means for classifying and indexing at least one audio signal representing an audio event, the computer usable code means having:
-
logic means for rendering the audio signal into intervals, each interval including one or more segments;
logic means for determining whether the segments of an interval exhibit one or more predetermined audio features, each audio feature being representative of at least one respective audio event;
logic means for classifying the intervals by associating the intervals with respective audio events in response to the means for determining;
logic means for determining whether at least one group of intervals matches a meta pattern in a predefined set of meta patterns;
logic means for associating the group of intervals with a meta pattern classification when it is determined that the group of intervals matches a meta pattern; and
logic means for indexing the audio signal based on interval classifications and meta pattern classifications. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
logic means for processing only relevant portions of the audio signal using a speech recognition engine to render words from the signal;
logic means for receiving, from the engine, alternatives to at least some of the words;
logic means for receiving, from the engine, confidence levels for at least some of the words and alternatives; and
logic means for indexing the words and alternatives based at least in part on the confidence levels.
-
-
16. The computer of claim 15, wherein alternatives are received only for words longer than “
- N”
characters and having a confidence of greater than “
x”
percent.
- N”
-
17. The computer of claim 16, wherein the words and alternatives are indexed based on respective weights.
-
18. The computer of claim 14, wherein each predetermined audio feature is based on one or more of:
- zero crossing rate of at least a portion of the audio signal, energy of at least a portion of the audio signal, spectral energy concentration of at least a portion of the audio signal; and
frequency.
- zero crossing rate of at least a portion of the audio signal, energy of at least a portion of the audio signal, spectral energy concentration of at least a portion of the audio signal; and
-
19. The computer of claim 14, wherein the predefined set of audio events comprises music, speech, silence, and speech on music.
-
20. The computer of claim 14, further comprising:
logic means for normalizing the segments, prior to classifying the intervals.
-
21. The computer of claim 19, wherein the predefined set of patterns includes continuous uninterrupted speech, and music combined with speech, the predefined set of patterns being heuristically defined.
-
22. The computer of claim 19, further comprising logic means for presenting at least portions of the intervals and meta pattern classifications for skimming, gisting, and summarizing the audio signal, using the indexing of the signal.
-
23. The computer of claim 14, wherein the logic means for determining whether the segments of an interval exhibit one or more predetermined audio features includes:
-
means for determining, for each segment in an interval, whether one or more audio features associated with the segment equals a respective threshold;
means for incrementing respective one or more counters associated with the one or more audio features when the respective features equal respective thresholds; and
means for comparing the one or more counters to the number of segments in the interval, the logic means for classifying the intervals undertaking the classifying of intervals based on the means for comparing.
-
-
24. The computer of claim 14, wherein the predefined set of audio event meta pqH further includes emphasis in speech, hesitation in speech, and concluding ideas in speech, such that the logic means for indexing can index the audio signal based thereon.
-
25. The computer of claim 24, further comprising:
-
means for determining one or more dominant frequencies in at least one interval classified as speech by the logic means for classifying the intervals;
means for associating one or more segments with emphasis in speech when the one or more segments includes a top N% of the dominant frequencies, wherein N is a number; and
means for associating one or more segments with concluding ideas in speech when the one or more segments includes a bottom N% of the dominant frequencies, wherein N is a number.
-
-
26. The computer of claim 25, further comprising means for determining whether temporally sequential segments, all associated with emphasis in speech, define a period greater than a predetermined period, and if so, indexing the temporally sequential segments as an important idea in speech.
-
27. A computer program product comprising:
-
a computer program storage device readable by a digital processing apparatus; and
a program means on the program storage device and including program code elements embodying instructions executable by the digital processing apparatus for performing method steps for indexing at least one audio signal, the method steps comprising;
rendering the audio signal into intervals, each interval including one or more segments;
determining whether the segments of an interval exhibit one or more predetermined audio features selected from a set of features including zero crossing rate of at least a portion of the audio signal, energy of at least a portion of the audio signal, frequency of at least a portion of the audio signal, and spectral energy concentration of at least a portion of the audio signal, each audio feature being representative of at least one respective audio event including at least music and speech;
classifying the intervals by associating the intervals with respective audio events in response to the determining step; and
indexing the audio signal based at least in part on the interval classification. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36)
processing only relevant portions of the audio signal using a speech recognition engine to render words from the signal;
receiving, from the engine, alternatives to at least some of the words;
receiving, from the engine, confidence levels for at least some of the words and alternatives; and
indexing the words and alternatives based at least in part on the confidence levels.
-
-
29. The computer program product of claim 28, wherein alternatives are received only for words longer than “
- N”
characters and having a confidence of greater than “
x”
percent.
- N”
-
30. The computer program product of claim 29, wherein the words and alternatives are indexed based on respective weights.
-
31. The computer program product of claim 27, wherein the method steps further comprise:
-
determining whether at least one group of intervals matches a meta pattern in a predefined set of meta patterns; and
associating the group of intervals with a meta pattern classification when it is determined that the group of intervals matches a meta pattern, the indexing of the audio signal being based at least in part on the meta pattern matching.
-
-
32. The computer program product of claim 31, wherein the predefined set of audio events further comprises silence, speech on music, emphasis in speech, hesitation in speech, and concluding ideas in speech.
-
33. The computer program product of claim 31, wherein the method steps further comprise:
normalizing the segments, prior to the classifying step.
-
34. The computer program product of claim 31, wherein the method steps further include:
-
determining, for each segment in an interval, whether one or more audio features associated with the segment equals a respective threshold;
incrementing respective one or more counters associated with the one or more audio features when the respective features equal respective thresholds; and
comparing the one or more counters to the number of segments in the interval, the logic means for classifying the intervals undertaking the classifying of intervals based on the means for comparing.
-
-
35. The computer program product of claim 34, wherein the method steps further comprise:
-
determining one or more dominant frequencies in at least one interval classified as speech during the step of classifying the intervals;
associating one or more segments with emphasis in speech when the one or more segments includes a top N% of the dominant frequencies, wherein N is a number; and
associating one or more segments with concluding ideas in speech when the one or more segments includes a bottom N% of the dominant frequencies, wherein N is a number.
-
-
36. The computer program product of claim 35, wherein the method steps further comprise determining whether temporally sequential segments, all associated with emphasis in speech, define a period greater than a predetermined period, and if so, defining and indexing the temporally sequential segments as an important idea in speech.
Specification