Neural network classifier for separating audio sources from a monophonic audio signal

US 20070083365A1
Filed: 10/06/2005
Published: 04/12/2007
Est. Priority Date: 10/06/2005
Status: Abandoned Application

First Claim

Patent Images

1. A method for separating audio sources from a monophonic audio signal, comprising:

(a) providing a monophonic audio signal comprising a down-mix of a plurality of unknown audio sources;

(b) separating the audio signal into a sequence of baseline frames;

(c) windowing each frame;

(d) extracting a plurality of audio features from each baseline frame that tend to distinguish the audio sources; and

(e) applying the audio features to a neural network (NN) classifier trained on a representative set of audio sources with said audio features, said neural network classifier outputting at least one measure of an audio source included in each said baseline frame of the monophonic audio signal.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A neural network classifier provides the ability to separate and categorize multiple arbitrary and previously unknown audio sources down-mixed to a single monophonic audio signal. This is accomplished by breaking the monophonic audio signal into baseline frames (possibly overlapping), windowing the frames, extracting a number of descriptive features in each frame, and employing a pre-trained nonlinear neural network as a classifier. Each neural network output manifests the presence of a pre-determined type of audio source in each baseline frame of the monophonic audio signal. The neural network classifier is well suited to address widely changing parameters of the signal and sources, time and frequency domain overlapping of the sources, and reverberation and occlusions in real-life signals. The classifier outputs can be used as a front-end to create multiple audio channels for a source separation algorithm (e.g., ICA) or as parameters in a post-processing algorithm (e.g. categorize music, track sources, generate audio indexes for the purposes of navigation, re-mixing, security and surveillance, telephone and wireless communications, and teleconferencing).

105 Citations

View as Search Results

27 Claims

1. A method for separating audio sources from a monophonic audio signal, comprising:
- (a) providing a monophonic audio signal comprising a down-mix of a plurality of unknown audio sources;
  
  (b) separating the audio signal into a sequence of baseline frames;
  
  (c) windowing each frame;
  
  (d) extracting a plurality of audio features from each baseline frame that tend to distinguish the audio sources; and
  
  (e) applying the audio features to a neural network (NN) classifier trained on a representative set of audio sources with said audio features, said neural network classifier outputting at least one measure of an audio source included in each said baseline frame of the monophonic audio signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 2. The method of claim 1, wherein the plurality of unknown audio sources are selected from a set of musical sources comprising at least voice, string and percussive.
  - 3. The method of claim 1, further comprising:
    - repeating steps (b) through (d) for a different frame size to extract features at multiple resolutions; and
      
      scaling the extracted audio features at the different resolutions to the baseline frame.
  - 4. The method of claim 3, further comprising applying the scaled features at each resolution to the NN classifier.
  - 5. The method of claim 3, further comprising fusing the scaled features at each resolution into a single feature that is applied to the NN classifier
  - 6. The method of claim 1, further comprising filtering the frames into a plurality of frequency sub-bands and extracting said audio features from said sub-bands.
  - 7. The method of claim 1, further comprising low-pass filtering the classifier outputs.
  - 8. The method of claim 1, wherein one or more audio features are selected from a set comprising tonal components, tone-to-noise ratio (TNR) and Cepstrum peak.
  - 9. The method of claim 8, wherein the tonal components are extracted by:
    - (f) applying a frequency transform to the windowed signal for each frame;
      
      (g) computing the magnitude of spectral lines in the frequency transform;
      
      (h) estimating a noise-floor;
      
      (i) identifying as tonal components the spectral components that exceed the noise floor by a threshold amount; and
      
      (j) outputting the number of tonal components as the tonal component feature.
  - 10. The method of claim 9, wherein the length of the frequency transform equals the number of audio samples in the frame for a certain time-frequency resolution.
  - 11. The method of claim 10, further comprising:
    - repeating the steps (f) through (i) for different frame and transform lengths; and
      
      outputting a cumulative number of tonal components at each time-frequency resolution.
  - 12. The method of claim 8, wherein the TNR feature is extracted by:
    - (k) applying a frequency transform to the windowed signal for each frame;
      
      (l) computing the magnitude of spectral lines in the frequency transform;
      
      (m) estimating a noise-floor;
      
      (n) determining a ratio of the energy of identified tonal components to the noise floor; and
      
      (o) outputting the ratio as the TNR feature.
  - 13. The method of claim 12, wherein the length of the frequency transform equals the number of audio samples in the frame for a certain time-frequency resolution.
  - 14. The method of claim 13, further comprising:
    - repeating the steps (k) through (n) for different frame and transform lengths; and
      
      averaging the ratios from the different resolutions over a time period equal to the baseline frame.
  - 15. The method of claim 12, wherein the noise floor is estimated by:
    - (p) applying a low-pass filter over magnitudes of spectral lines, (q) marking components sufficiently above the filter output, (r) replacing the marked components with the low-pass filter output, (s) repeating steps (a) through (c) a number of times, and (t) outputting the resulting components as the noise floor estimation.
  - 16. The method of claim 1, wherein the Neural Network classifier includes a plurality of output neurons that each indicate the presence of a certain audio source in the monophonic audio signal.
  - 17. The method of claim 16, wherein the value of each output neuron indicates a confidence that the baseline frame includes the certain audio source.
  - 18. The method of claim 1, further comprising using the measure to remix the monophonic audio signal into a plurality of audio channels for the respective audio sources in the representative set.
  - 19. The method of claim 18, wherein the monophonic audio signal is remixed by switching it to the audio channel identified as the most prominent.
  - 20. The method of claim 18, wherein the Neural Network classifier outputs a measure for each of the audio sources in the representative set that indicates a confidence that the frame includes the corresponding audio source, said monophonic audio signal being attenuated by each of said measures and directed to the respective audio channels.
  - 21. The method of claim 18, further comprising processing said plurality of audio channels using a source separation algorithm that requires at least as many input audio channels as audio sources to separate said plurality of audio channels into an equal or lesser plurality of said audio sources.
  - 22. The method of claim 21, wherein said source separation algorithm is based on blind source separation (BSS).
  - 23. The method of claim 1, further comprising passing the monophonic audio signal and the sequence of said measures to a post-processor that uses said measures to augment the post-processing of the monophonic audio signal.

24. A method for separating audio sources from a monophonic audio signal, comprising:
- (a) providing a monophonic audio signal comprising a down-mix of a plurality of unknown audio sources;
  
  (b) separating the audio signal into a sequence of baseline frames;
  
  (c) windowing each frame;
  
  (d) extracting a plurality of audio features from each baseline frame that tend to distinguish the audio sources;
  
  (e) repeating steps (b) through (d) with a different frame size to extract features at multiple resolutions;
  
  (f) scaling the extracted audio features at the different resolutions to the baseline frame; and
  
  (g) applying the audio features to a neural network (NN) classifier trained on a representative set of audio sources with said audio features, said neural network classifier having a plurality of output neurons that each signal the presence of a certain audio source in the monophonic audio signal for each baseline frame.

25. An audio source classifier, comprising:
- A framer for separating a monophonic audio signal comprising a down-mix of a plurality of unknown audio sources into a sequence of windowed baseline frames;
  
  A feature extractor for extracting a plurality of audio features from each baseline frame that tend to distinguish the audio sources; and
  
  A neural network (NN) classifier trained on a representative set of audio sources with said audio features, said neural network classifier receiving the extracted audio features and outputting at least one measure of an audio source included in each said baseline frame of the monophonic audio signal.
- View Dependent Claims (26, 27)
- - 26. The audio source classifier of claim 25, wherein the feature extractor extracts one or more of the audio features at multi time-frequency resolutions.
  - 27. The audio source classifier of claim 25, wherein the NN classifier has a plurality of output neurons that each signal the presence of a certain audio source in the monophonic audio signal for each baseline frame.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
DTS, Inc. (Adeia, Inc.)
Original Assignee
DTS, Inc. (Adeia, Inc.)
Inventors
Shmunk, Dmitri

Application Number

US11/244,554
Publication Number

US 20070083365A1
Time in Patent Office

Days
Field of Search
US Class Current

704/232
CPC Class Codes

G10L 21/0272 Voice signal separating

G10L 25/30 using neural networks

Neural network classifier for separating audio sources from a monophonic audio signal

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

105 Citations

27 Claims

Specification

Use Cases

Quick Links

Others

Neural network classifier for separating audio sources from a monophonic audio signal

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

105 Citations

27 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others