AUTOMATIC SYSTEM FOR TEMPORAL ALIGNMENT OF MUSIC AUDIO SIGNAL WITH LYRICS

US 20080097754A1
Filed: 08/07/2007
Published: 04/24/2008
Est. Priority Date: 10/24/2006
Status: Active Grant

First Claim

Patent Images

1. An automatic system for temporal alignment between music audio signal and lyrics, comprising:

dominant sound audio signal extraction means for extracting, from a music audio signal of music including vocals and accompaniment sounds, a dominant sound audio signal of the most dominant sound including the vocal at each time,vocal-section feature extraction means for extracting a vocal-section feature available to estimate a vocal section which includes the vocal and a non-vocal section which does not include the vocal, from the dominant sound audio signal at each time,vocal section estimation means for estimating the vocal section and the non-vocal section, based on a plurality of the vocal-section features and outputting information on the vocal section and the non-vocal section,temporal-alignment feature extraction means for extracting a temporal-alignment feature suitable to make temporal alignment between lyrics of the vocal and the music audio signal, from the dominant sound audio signal at each time,phoneme network storage means for storing a phoneme network constituted from a plurality of phonemes and short pauses in respect of lyrics in music corresponding to the music audio signal, andalignment means for performing an alignment operation that makes temporal alignment between the plurality of phonemes in the phoneme network and the dominant sound audio signals, the alignment means being provided with a phone model for singing voice that estimates a phoneme corresponding to the temporal-alignment feature, based on the temporal-alignment feature, whereinthe alignment means receives the temporal-alignment feature outputted from the temporal-alignment feature extraction means, the information on the vocal section and the non-vocal section, and the phoneme network, and performs the alignment operation on condition that no phoneme exists at least in the non-vocal section.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An automatic system for temporal alignment between a music audio signal and lyrics is provided. The automatic system can prevent accuracy for temporal alignment from being lowered due to the influence of non-vocal sections. Alignment means of the system is provided with a phone model for singing voice that estimates phonemes corresponding to temporal-alignment features or features available for temporal alignment. The alignment means receives temporal-alignment features outputted from temporal-alignment feature extraction means, information on the vocal and non-vocal sections outputted from vocal section estimation means, and a phoneme network, and performs an alignment operation on condition that no phoneme exists at least in non-vocal sections.

60 Citations

View as Search Results

13 Claims

1. An automatic system for temporal alignment between music audio signal and lyrics, comprising:
- dominant sound audio signal extraction means for extracting, from a music audio signal of music including vocals and accompaniment sounds, a dominant sound audio signal of the most dominant sound including the vocal at each time,vocal-section feature extraction means for extracting a vocal-section feature available to estimate a vocal section which includes the vocal and a non-vocal section which does not include the vocal, from the dominant sound audio signal at each time,vocal section estimation means for estimating the vocal section and the non-vocal section, based on a plurality of the vocal-section features and outputting information on the vocal section and the non-vocal section,temporal-alignment feature extraction means for extracting a temporal-alignment feature suitable to make temporal alignment between lyrics of the vocal and the music audio signal, from the dominant sound audio signal at each time,phoneme network storage means for storing a phoneme network constituted from a plurality of phonemes and short pauses in respect of lyrics in music corresponding to the music audio signal, andalignment means for performing an alignment operation that makes temporal alignment between the plurality of phonemes in the phoneme network and the dominant sound audio signals, the alignment means being provided with a phone model for singing voice that estimates a phoneme corresponding to the temporal-alignment feature, based on the temporal-alignment feature, whereinthe alignment means receives the temporal-alignment feature outputted from the temporal-alignment feature extraction means, the information on the vocal section and the non-vocal section, and the phoneme network, and performs the alignment operation on condition that no phoneme exists at least in the non-vocal section.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The automatic system for temporal alignment between music audio signal and lyrics according to claim 1, whereinthe vocal section estimation means includes Gaussian model storage means for storing a plurality of Gaussian mixture models for vocals and non-vocals obtained in advance by training based on a plurality of training musical datasets, andthe vocal section estimation means estimates the vocal section and the non-vocal section, based on the plurality of vocal-section features and the plurality of Gaussian mixture models.
  - 3. The automatic system for temporal alignment between music audio signal and lyrics according to claim 2, wherein the vocal section estimation means includes:
    - log likelihood calculation means for calculating a vocal log likelihood and a non-vocal log likelihood at the each time, based on the vocal-section feature and the Gaussian mixture model at the each time,log likelihood difference calculation means for calculating a log likelihood difference between the vocal log likelihood and the non-vocal log likelihood at the each time,histogram creation means for creating a histogram relating to a plurality of log likelihood differences obtained over the whole period of the music audio signal,bias correction value determination means for defining a threshold to maximize between-class variance, and determining the threshold as a music-dependent bias correction value when the histogram is divided into two music-dependent classes, the music-dependent log likelihood differences in the vocal sections and those in the non-vocal sections,estimation parameter determination means for determining an estimation parameter used in estimating a vocal section by adding a task-dependent value to the bias correction value in order to correct the bias correction value,weighting means for weighting the vocal log likelihood and the non-vocal log likelihood at the each time using the estimation parameter, andmost likely route calculation means for defining the weighted vocal log likelihoods and the weighted non-vocal log likelihoods which are obtained over the whole period of the music audio signal as an output probability of a vocal state (S_V) and an output probability of a non-vocal state (S_N) in a Hidden Markov Model, respectively, calculating the most likely routes for the vocal state and the non-vocal state over the whole period of the music audio signal, and determining, based on the most likely routes, information on the vocal and non-vocal sections over the whole period of the music audio signal.
  - 4. The automatic system for temporal alignment between music audio signal and lyrics according to claim 3, whereinthe weighting means approximates an output probability of log p(x|S_V) for the vocal state (S_V) and an output probability of log P(x|S_N) for the non-vocal state (S_N) with the following equations:
    - $\log p (x | s_{V}) = \log N_{GMM} (x; θ_{V}) - \frac{1}{2} η$ $\log p (x | s_{N}) = \log N_{GMM} (x; θ_{N}) + \frac{1}{2} η$ where N_GMM(X;
      
      θ
      
      _V) stands for the probability density function of Gaussian mixture model (GMM) for vocals, N_GMM(X;
      
      θ
      
      _N) for the probability density function of Gaussian mixture model (GMM) for non-vocals, θ
      
      _Vand θ
      
      _Nare parameters determined in advance by training based on the plurality of training musical datasets, and η
      
      is the estimation parameter, and the most likely route calculation means calculates the most likely route with the following equation;
      
      $\hat{S} = \underset{S}{arg \max} \sum_{l} {\log p (x | s_{t}) + \log p (s_{t + 1} | s_{t})}$ where p(x|S_t) stands for an output probability for a state S_tand p(S_t+1|S_t) for a transition probability from a state S_tto a state S_t+1.
  - 5. The automatic system for temporal alignment between music audio signal and lyrics according to claim 1, wherein the alignment means performs an alignment operation using Viterbi alignment, and the alignment operation is performed on condition that no phoneme exists in the non-vocal section when Viterbi alignment is performed, at least the non-vocal section is defined as a short pause, and likelihoods for other phonemes in the short pause are set to zero.
  - 6. The automatic system for temporal alignment between music audio signal and lyrics according to claim 1, wherein the phone model for singing voice is a phone model that is obtained by re-estimating parameters of a phone model for speaking voice so as to recognize phonemes of the vocals in the music including vocals and accompaniment sounds.
  - 7. The automatic system for temporal alignment between music audio signal and lyrics according to claim 6, wherein the phone model for singing voice is a phone model for vocals without accompaniments that is obtained by re-estimating parameters of the phone model for speaking voice, using a music audio signal for adaptation to vocals without accompaniments and phoneme labels for adaptation corresponding to the music audio signal for adaptation, so as to recognize phonemes of the vocals from the music audio signal for adaptation.
  - 8. The automatic system for temporal alignment between music audio signal and lyrics according to claim 6, whereinthe phone model is a phone model for segregated vocals that is obtained by preparing a phone model for vocals without accompaniments obtained by re-estimating parameters of the phone model for speaking voice, using a music audio signal for adaptation to vocals without accompaniments and phoneme labels for adaptation corresponding to the music audio signal for adaptation, so as to recognize phonemes of the vocals from the music audio signal for adaptation, andby re-estimating parameters of the phone model for vocals without accompaniments, using dominant sound music audio signals of the most dominant sounds including the vocals extracted from the music audio signal for adaptation including vocals as well as accompaniment sounds, and phoneme labels for adaptation corresponding to the dominant sound music audio signals, so as to recognize phonemes of the vocals from the dominant sound music audio signals.
  - 9. The automatic system for temporal alignment between music audio signal and lyrics according to claim 6, whereinthe phone model is a phone model for a particular singer that is obtained by preparing a phone model for vocals without accompaniments obtained by re-estimating parameters of the phone model for speaking voice, using a music audio signal for adaptation to vocals without accompaniments and phoneme labels for adaptation corresponding to the music audio signal for adaptation, so as to recognize phonemes of the vocals from the music audio signal for adaptation,by re-estimating parameters of the phone model for vocals without accompaniments, using dominant sound music audio signals of the most dominant sounds including the vocals extracted from the music audio signal for adaptation including vocals as well as accompaniment sounds, and phoneme labels for adaptation corresponding to the dominant sound music audio signals, so as to recognize phonemes of the vocals from the dominant sound music audio signals, andby estimating parameters of the phone model for segregated vocals, using the temporal-alignment features stored in the temporal-alignment feature extraction means and the phoneme network stored in the phoneme network storage means, so as to recognize phonemes of the vocals of a particular singer singing the music of the music audio signal inputted into the music audio signal extraction means.
  - 10. A music audio signal reproducing apparatus which reproduces a music audio signal while displaying on a screen lyrics temporally aligned with the music audio signal to be reproduced, using the system of claim 1 to display on the screen the lyrics temporally aligned with the music audio signal.

11. A method of automatically making temporal alignment between music audio signal and lyrics, comprising the steps of:
- extracting a dominant sound audio signal of the most dominant sound from a music audio signal of music at each time with dominant sound audio signal extraction means, wherein the most dominant sound includes a vocal from the music, the music including vocals and accompaniment sounds,extracting a vocal-section feature available to estimate a vocal section and a non-vocal section from the dominant sound audio signal at each time with vocal-section feature extraction means, wherein the vocal section includes the vocal and the non-vocal section does not include the vocal,estimating the vocal section and the non-vocal section and outputting information on the vocal section and the non-vocal section with vocal section estimation means, wherein the vocal and non-vocal sections are estimated based on a plurality of the vocal-section features,extracting a temporal-alignment feature suitable to make temporal alignment between lyrics of the vocal and the music audio signal, from the dominant sound audio signal at each time, with temporal-alignment feature extraction means,storing a phoneme network with phoneme network storage means, the phoneme network being constituted from a plurality of phonemes and short pauses in respect of lyrics in music corresponding to the music audio signal, andperforming an alignment operation, which makes the temporal alignment between the plurality of phonemes in the phoneme network and the dominant sound audio signals, with alignment means, wherein the alignment means is provided with a phone model for singing voice that estimates a phoneme corresponding to the temporal-alignment feature, based on the temporal-alignment feature, and the alignment means receives the temporal-alignment feature obtained in the step of extracting the temporal-alignment feature, the information on the vocal section and the non-vocal section, and the phoneme network, and then performs the alignment operation on condition that no phoneme exists at least in the non-vocal section.

12. A computer program for temporal alignment between music audio signal and lyrics, causing a computer to implement:
- dominant sound audio signal extraction means for extracting, from a music audio signal of music including vocals and accompaniment sounds, a dominant sound audio signal of the most dominant sound including the vocal at each time,vocal-section feature extraction means for extracting a vocal-section feature available to estimate a vocal section which includes the vocal and a non-vocal section which does not include the vocal, from the dominant sound audio signal at each time,vocal section estimation means for estimating the vocal section and the non-vocal section, based on a plurality of the vocal-section features and outputting information on the vocal section and the non-vocal section,temporal-alignment feature extraction means for extracting a temporal-alignment feature suitable to make temporal alignment between lyrics of the vocal and the music audio signal from the dominant sound audio signal at each time,phoneme network storage means for storing a phoneme network constituted from a plurality of phonemes and short pauses in respect of lyrics in music corresponding to the music audio signal, andalignment means for performing an alignment operation that makes the temporal alignment between the plurality of phonemes in the phoneme network and the dominant sound audio signals, the alignment means being provided with a phone model for singing voice that estimates a phoneme corresponding to the temporal-alignment feature, based on the temporal-alignment feature, whereinthe alignment means receives the temporal-alignment feature outputted from the temporal-alignment feature extraction means, the information on the vocal section and the non-vocal section, and the phoneme network, and performs the alignment operation on condition that no phoneme exists at least in the non-vocal section.
- View Dependent Claims (13)
- - 13. A computer-readable recording medium recorded with the computer program of claim 12.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
National Institute of Advanced Industrial Science and Technology (Government of Japan)
Original Assignee
National Institute of Advanced Industrial Science and Technology (Government of Japan)
Inventors
Fujihara, Hiromasa, Goto, Masataka, Okuno, Hiroshi

Granted Patent

US 8,005,666 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/214
CPC Class Codes

G10L 15/187 Phonemic context, e.g. pron...

G10L 15/26 Speech to text systems G10L...

AUTOMATIC SYSTEM FOR TEMPORAL ALIGNMENT OF MUSIC AUDIO SIGNAL WITH LYRICS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

60 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

AUTOMATIC SYSTEM FOR TEMPORAL ALIGNMENT OF MUSIC AUDIO SIGNAL WITH LYRICS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

60 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links