Method and apparatus for automatically recognizing audio data

US 8,140,329 B2
Filed: 04/05/2004
Issued: 03/20/2012
Est. Priority Date: 07/28/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A method of identifying, among a plurality of music audio files in digital format generated by machine, a first one of the music audio files, the method employing a segment of audio data which is derived from the first music audio file and comprising the steps of:

(a) inputting the segment of audio data generated by the machine into three different extraction processes, the three different extraction processes including (1) an IMFCC1 (first improved mel frequency cepstrum coefficients) extraction process, the IMFCC1 extraction process performing a conventional MFCC (mel frequency cepstrum coefficients) algorithm but not performing a logarithmic step of the conventional MFCC algorithm, wherein IMFCC1 audio features are output, (2) an IMFCC2 (second improved mel frequency cepstrum coefficients) extraction process, the IMFCC2 extraction process performing the conventional MFCC algorithm but not performing both the logarithmic step and a discrete cosine transform step of the conventional MFCC algorithm and instead performing an ICA (independent component analysis) process, wherein IMFCC2 audio features are output, and (3) an ICA1 (improved independent component analysis) extraction process performing a conventional ICA (independent component analysis) process but subjecting the segment of audio data to pre-emphasis preprocessing and windowing preprocessing, wherein ICA1 features are output;

(b) creating an observation vector containing the IMFCC1 audio features, the IMFCC2 audio features and the ICA1 audio features; and

(c) recognizing the machine generated first music audio file using the observation vector and a database trained using only observation vectors containing IMFCC1, IMFCC2 and ICA1 audio features for each respective target music audio file;

wherein the audio features comprise features obtained by analyzing the audio data, or a transformed version of the audio data, to derive a transform based on its audio features, and applying the transform to the audio data, or the transformed version of the audio data respectively, to obtain amplitudes of the audio features.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus are proposed for automatically recognizing observed audio data. An observation vector is created of audio features extracted from the observed audio data and the observed audio data is recognized from the observation vector. The audio features include features are selected from a group of 3 types of features obtained from the observed audio data: (i) ICA features obtained by processing the observed audio data, (ii) first MFCC features obtained by removing a logarithm step from the conventional MFCC process, or (iii) second MFCC features obtained by applying the ICA process to results of a mel scale filter bank.

32 Citations

View as Search Results

18 Claims

1. A method of identifying, among a plurality of music audio files in digital format generated by machine, a first one of the music audio files, the method employing a segment of audio data which is derived from the first music audio file and comprising the steps of:
- (a) inputting the segment of audio data generated by the machine into three different extraction processes, the three different extraction processes including (1) an IMFCC1 (first improved mel frequency cepstrum coefficients) extraction process, the IMFCC1 extraction process performing a conventional MFCC (mel frequency cepstrum coefficients) algorithm but not performing a logarithmic step of the conventional MFCC algorithm, wherein IMFCC1 audio features are output, (2) an IMFCC2 (second improved mel frequency cepstrum coefficients) extraction process, the IMFCC2 extraction process performing the conventional MFCC algorithm but not performing both the logarithmic step and a discrete cosine transform step of the conventional MFCC algorithm and instead performing an ICA (independent component analysis) process, wherein IMFCC2 audio features are output, and (3) an ICA1 (improved independent component analysis) extraction process performing a conventional ICA (independent component analysis) process but subjecting the segment of audio data to pre-emphasis preprocessing and windowing preprocessing, wherein ICA1 features are output;
  
  (b) creating an observation vector containing the IMFCC1 audio features, the IMFCC2 audio features and the ICA1 audio features; and
  
  (c) recognizing the machine generated first music audio file using the observation vector and a database trained using only observation vectors containing IMFCC1, IMFCC2 and ICA1 audio features for each respective target music audio file;
  
  wherein the audio features comprise features obtained by analyzing the audio data, or a transformed version of the audio data, to derive a transform based on its audio features, and applying the transform to the audio data, or the transformed version of the audio data respectively, to obtain amplitudes of the audio features.
- View Dependent Claims (2, 3, 4)
- - 2. A method according to claim 1 in which the transform is an independent component analysis (ICA) of the audio data or the transformed version of the audio data.
  - 3. A method according to claim 1 in which the audio features include said ICA1 features, and the step of computing the ICA1 features comprises the steps of:
    - pre-emphasizing the audio data to improve the SNR of the data;
      
      windowing the pre-emphasized data; and
      
      ICA transforming the windowed data with ICA basis functions and weight functions to obtain the ICA1 features.
  - 4. A method according to claim 1, in which the audio features include said IMFCC2 features, and the IMFCC2 features are obtained by further including the steps of:
    - preprocessing the audio data to pre-emphasize and window the audio data;
      
      transforming the processed data from the time domain into the frequency domain; and
      
      ICA processing Mel spectrum data to obtain ICA coefficients as the IMFCC2 features.

5. A method of identifying, among a plurality of music audio files in digital format generated by machine, a first one of the music audio files, the method employing a segment of audio data which is derived from the first music audio file and comprising the steps of:
- (a) inputting the segment of audio data generated by the machine into three different extraction processes, the three different extraction processes including (1) an IMFCC1 (first improved mel frequency cepstrum coefficients) extraction process, the IMFCC1 extraction process performing a conventional MFCC (mel frequency cepstrum coefficients) algorithm but not performing a logarithmic step of the conventional MFCC, wherein IMFCC1 audio features are output, (2) an IMFCC2 (second improved mel frequency cepstrum coefficients) extraction process, the IMFCC2 extraction process performing the conventional MFCC algorithm but not performing both the logarithmic step and a discrete cosine transform step of the conventional MFCC algorithm and instead performing an ICA (independent component analysis) process, wherein IMFCC2 audio features are output, and (3) an ICA1 (improved independent component analysis) extraction process performing a conventional ICA (independent component analysis) process but subjecting the segment of audio data to re-emphasis preprocessing and windowing preprocessing, wherein ICA1 audio features output;
  
  (b) creating an observation containing the IMFCC1 audio features, the IMFCC2 audio features and ICA1 audio features; and
  
  (c) recognizing the machine generated first music audio file using the observation vector and a database trained using only observation vectors containing IMFCC1, IMFCC2 and ICA1 audio features for each respective target music audio file.
- View Dependent Claims (6, 7)
- - 6. A method according to claim 5, in which the IMFCC2 features are further obtained by:
    - preprocessing the audio data to pre-emphasize and window the audio data;
      
      transforming the processed data from time domain into the frequency domain; and
      
      converting Mel spectrum data back to time domain data to obtain the IMFCC2 features.
  - 7. A method according to claim 5, wherein step (b) is performed by determining, within a database containing HMM models for each respective target audio file, the HMM models for which probability of the observation vector being obtained given the target audio file is a maximal.

8. A method of identifying, among a plurality of music audio files in digital format generated by machine, a first one of the music audio files, the method employing a segment of audio data which is derived from the first music audio file and comprising the steps of:
- (a) inputting the segment of audio data generated by the machine into three different extraction processes, the three different extraction processes including (1) an IMFCC1 (first improved mel frequency cepstrum coefficients) extraction process, the IMFCC1 extraction process performing a conventional MFCC (mel frequency cepstrum coefficients) algorithm but not performing a logarithmic step of the conventional MFCC algorithm, wherein IMFCC1 audio features are output, (2) an IMFCC2 (second improved mel frequency cepstrum coefficients) extraction process, the IMFCC2 extraction process performing the conventional MFCC algorithm but not performing both the logarithmic step and a discrete cosine transform step of the conventional MFCC algorithm and instead performing an ICA (independent component analysis) process, wherein IMFCC2 audio features are output, and (3) an ICA1 (improved independent component analysis) extraction process performing a conventional ICA (independent component analysis) process but subjecting the segment of audio data to pre-emphasis preprocessing and windowing preprocessing, wherein ICA1 audio features are output;
  
  (b) creating an observation vector containing the IMFCC1, IMFCC2 and ICA1 audio features;
  
  (c) recognizing the machine generated first music audio file using the observation vector;
  
  wherein step (c) is performed by determining, within a database containing HMM models trained using only observation vectors containing IMFCC1, IMFCC2 and ICA1 audio features for each respective target music audio file, the HMM model for which probability of the observation vector being obtained given the target music audio file is maximal.
- View Dependent Claims (9)
- - 9. A method for generating a database of HMM models for use in a method according to claim 8, the method comprising the steps of:
    - extracting a plurality of segments from each of the target audio files;
      
      generating training data which is the amplitudes of statistically significant audio features of the segments;
      
      initializing HMM model parameters for the target audio file with the training data by an HMM initialization algorithm;
      
      training the initialized model parameters to optimize the model parameters by an HMM training algorithm; and
      
      establishing an audio modeling database of the trained HMM model parameters.

10. An apparatus for identifying, among a plurality of music audio files in digital format generated by machine, a first one of the music audio files, based on a segment of audio data which is derived from the first music audio file, the apparatus comprising:
- (a) input unit inputting the segment of audio data generated by the machine into three different extraction processes, the three different extraction processes including (1) an IMFCC1 (first improved mel frequency cepstrum coefficients) extraction process, the IMFCC1 extraction process performing a conventional MFCC (mel frequency cepstrum coefficients) algorithm but not performing a logarithmic step of the conventional MFCC algorithm, wherein IMFCC 1 audio features are output, (2) an IMFCC2 (second improved mel frequency cepstrum coefficients) extraction process, the IMFCC2 extraction process performing the conventional MFCC algorithm but not performing both the logarithmic step and a discrete cosine transform step of the conventional MFCC algorithm and instead performing an ICA (independent component analysis) process, wherein IMFCC2 audio features are output, and (3) an ICA1 (improved independent component analysis) extraction process performing a conventional ICA (independent component analysis) process but subjecting the segment of audio data to pre-emphasis preprocessing and windowing preprocessing, wherein ICA1 audio features are output;
  
  (b) creation unit creating an observation vector containing the IMFCC1, IMFCC2 and ICA1 audio features output by the three different extraction processes respectively; and
  
  (c) recognition unit recognizing the machine generated first music audio file using the observation vector and a database trained using only observation vectors containing IMFCC1, IMFCC2 and ICA1 audio features for each respective target music audio file;
  
  wherein the audio features comprise features obtained by analyzing the audio data, or a transformed version of the audio data, to derive a transform based on its audio features, and applying the transform to the audio data, or the transformed version of the audio data respectively, to obtain amplitudes of the audio features.
- View Dependent Claims (11, 12)
- - 11. An apparatus according to claim 10 in which the transform is an independent component analysis (ICA) of the audio data or the transformed version of the audio data.
  - 12. An apparatus according to claim 10, wherein said recognition unit comprises:
    - a database containing HMM models for each respective target audio file, anda determination unit determining the HMM models in the database for which probability of the observation vector being obtained given the target audio file is a maximal.

13. An apparatus for identifying, among a plurality of music audio files in digital format generated by machine, a first one of the music audio files, based on a segment of audio data which is derived from the first music audio file, the apparatus comprising:
- (a) input unit inputting the segment of audio data generated by the machine into three different extraction processes, the three different extraction processes including (1) an IMFCC1 (first improved mel frequency cepstrum coefficients) extraction process, the IMFCC1 extraction process performing a conventional MFCC (mel frequency cepstrum coefficients) algorithm but not performing a logarithmic step of the conventional MFCC algorithm, wherein IMFCC1 audio features are output, (2) an IMFCC2 (second improved mel frequency cepstrum coefficients) extraction process, the IMFCC2 extraction process performing the conventional MFCC algorithm but not performing both the logarithmic step and a discrete cosine transform step of the conventional MFCC algorithm and instead performing an ICA (independent component analysis) wherein IMFCC2 audio features are output, and (3) an ICA1 (improved independent component analysis) extraction process performing a conventional ICA (independent component analysis) process but subjectin the segment of audio data to pre-emphasis preprocessing and windowing preprocessing wherein ICA1 audio features are output;
  
  (b) creation unit creating an observation vector containing IMFCC1, IMFCC2 and ICA1 audio features output by the three different extraction processes respectively; and
  
  (c) recognition unit recognizing the machine generated first music audio file using the observation vector and a database trained using only observation vectors containing IMFCC1, IMFCC2 and ICA1 audio features for each respective target music audio file.
- View Dependent Claims (14)
- - 14. An apparatus according to claim 13,wherein said recognition unit comprises:
    - a database containing HMM models for each respective target audio file, anda determination unit determining the HMM models in the database for which probability of the observation vector being obtained given the target audio file is a maximal.

15. An apparatus for identifying, among a plurality of music audio files in digital format generated by machine, a first one of the music audio files, based on a segment of audio data which is derived from the first music audio file, the apparatus comprising:
- (a) input unit inputting the segment of audio data generated by the machine into three different extraction processes, the different extraction processes including (1) an IMFCC1 (first improved mel frequency cepstrum coefficients) extraction process, the IMFCC1 extraction process performing a conventional MFCC mel frequency cepstrum coefficients) algorithm but not performing a logarithmic step of the conventional MFCC algorithm, wherein IMFCC1 audio features are output, (2) an IMFCC2 (second improved mel frequency cepstrum coefficients) extraction process, the IMFCC2 extraction process performing the conventional MFCC algorithm but not performing both the logarithmic step and a discrete cosine transform step of the conventional MFCC algorithm and instead performing an ICA (independent component analysis) process, wherein IMFCC2 audio features are output, and (3) an ICA1 (improved independent component analysis) extraction process performing a conventional ICA (independent component analysis) process but subjecting the segment of audio data to pre-emphasis preprocessing and windowing preprocessing, wherein ICA1 audio features are output;
  
  (b) creation unit creating an observation vector containing the IMFCC1, IMFCC2 and ICA1 audio features output by the three different extraction processes respectively;
  
  (c) a database containing HMM models trained using only observation vectors containing IMFCC1, IMFCC2 and ICA1 audio features for each respective target machine generated music audio file, and(d) determination unit determining the HMM model in the database for which probability of the observation vector being obtained given the target music audio file is maximal.
- View Dependent Claims (16)
- - 16. An apparatus according to claim 15, further comprising:
    - means for extracting as training data a plurality of segments from each of the target audio files;
      
      means for initializing HMM model parameters for the target audio file with the training data by an HMM initialization algorithm;
      
      means for training the initialized model parameters to optimize the model parameters by an HMM training algorithm; and
      
      means for establishing an audio modeling database of the trained HMM model parameters.

17. A method of identifying, among a plurality of audio files in digital format generated by machine, a first one of the audio files, the method employing a segment of audio data which is derived from the first audio file and comprising the steps of:
- (a) inputting the segment of audio data generated by the machine into different extraction processes, at least one of the different extraction processes including an IMFCC (improved mel frequency cepstrum coefficients) extraction process, the IMFCC extraction process performing a conventional MFCC (mel frequency cepstrum coefficients) algorithm but not performing both a logarithmic step of the conventional MFCC algorithm, and a discrete cosine transform step of the conventional MFCC algorithm and instead performing an ICA (independent component analysis) process, wherein IMFCC audio features are output;
  
  (b) creating an observation vector containing at least the IMFCC audio features extracted from the segment of audio data; and
  
  (c) recognizing the machine generated first audio file using the observation vector;
  
  wherein the audio features comprise features obtained by analyzing the audio data, or a transformed version of the audio data, to derive a transform based on its audio features, and applying the transform to the audio data, or the transformed version of the audio data respectively, to obtain amplitudes of the audio features.

18. An apparatus for identifying, among a plurality of audio files in digital format generated by machine, a first one of the audio files, based on a segment of audio data which is derived from the first audio file, the apparatus comprising:
- (a) input unit inputting the segment of audio data generated by the machine into different extraction processes, at least one of different extraction processes including an IMFCC (improved mel frequency cepstrum coefficients) extraction process, the IMFCC extraction process performing a conventional MFCC (mel frequency cepstrum coefficients) algorithm but not performing both a logarithmic step of the conventional MFCC algorithm, and a discrete cosine transform step of the conventional MFCC algorithm and instead performing an ICA (independent component analysis) process, wherein IMFCC audio features are output;
  
  (b) creation unit creating an observation vector containing at least the IMFCC audio features extracted from the segment of audio data; and
  
  (c) recognition unit recognizing the machine generated first audio file using the observation vector;
  
  wherein the audio features comprise features obtained by analyzing the audio data, or a transformed version of the audio data, to derive a transform based on its audio features, and applying the transform to the audio data, or the transformed version of the audio data respectively, to obtain amplitudes of the audio features.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sony Corporation (Sony Group Corp.)
Original Assignee
Sony Corporation (Sony Group Corp.)
Inventors
Zhang, Jian, Lu, Wei, Sun, Xiaobing
Primary Examiner(s)
Smits, Talivaldis Ivars
Assistant Examiner(s)
PULLIAS, JESSE SCOTT

Application Number

US10/818,625
Publication Number

US 20050027514A1
Time in Patent Office

2,906 Days
Field of Search

704/205, 704/256.1, 704/237, 704/236, 704/231
US Class Current

704/237
CPC Class Codes

G10L 25/48 specially adapted for parti...

Method and apparatus for automatically recognizing audio data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

32 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for automatically recognizing audio data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

32 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links