AUTOMATIC SYSTEM FOR TEMPORAL ALIGNMENT OF MUSIC AUDIO SIGNAL WITH LYRICS
First Claim
1. An automatic system for temporal alignment between music audio signal and lyrics, comprising:
- dominant sound audio signal extraction means for extracting, from a music audio signal of music including vocals and accompaniment sounds, a dominant sound audio signal of the most dominant sound including the vocal at each time,vocal-section feature extraction means for extracting a vocal-section feature available to estimate a vocal section which includes the vocal and a non-vocal section which does not include the vocal, from the dominant sound audio signal at each time,vocal section estimation means for estimating the vocal section and the non-vocal section, based on a plurality of the vocal-section features and outputting information on the vocal section and the non-vocal section,temporal-alignment feature extraction means for extracting a temporal-alignment feature suitable to make temporal alignment between lyrics of the vocal and the music audio signal, from the dominant sound audio signal at each time,phoneme network storage means for storing a phoneme network constituted from a plurality of phonemes and short pauses in respect of lyrics in music corresponding to the music audio signal, andalignment means for performing an alignment operation that makes temporal alignment between the plurality of phonemes in the phoneme network and the dominant sound audio signals, the alignment means being provided with a phone model for singing voice that estimates a phoneme corresponding to the temporal-alignment feature, based on the temporal-alignment feature, whereinthe alignment means receives the temporal-alignment feature outputted from the temporal-alignment feature extraction means, the information on the vocal section and the non-vocal section, and the phoneme network, and performs the alignment operation on condition that no phoneme exists at least in the non-vocal section.
1 Assignment
0 Petitions
Accused Products
Abstract
An automatic system for temporal alignment between a music audio signal and lyrics is provided. The automatic system can prevent accuracy for temporal alignment from being lowered due to the influence of non-vocal sections. Alignment means of the system is provided with a phone model for singing voice that estimates phonemes corresponding to temporal-alignment features or features available for temporal alignment. The alignment means receives temporal-alignment features outputted from temporal-alignment feature extraction means, information on the vocal and non-vocal sections outputted from vocal section estimation means, and a phoneme network, and performs an alignment operation on condition that no phoneme exists at least in non-vocal sections.
60 Citations
13 Claims
-
1. An automatic system for temporal alignment between music audio signal and lyrics, comprising:
-
dominant sound audio signal extraction means for extracting, from a music audio signal of music including vocals and accompaniment sounds, a dominant sound audio signal of the most dominant sound including the vocal at each time, vocal-section feature extraction means for extracting a vocal-section feature available to estimate a vocal section which includes the vocal and a non-vocal section which does not include the vocal, from the dominant sound audio signal at each time, vocal section estimation means for estimating the vocal section and the non-vocal section, based on a plurality of the vocal-section features and outputting information on the vocal section and the non-vocal section, temporal-alignment feature extraction means for extracting a temporal-alignment feature suitable to make temporal alignment between lyrics of the vocal and the music audio signal, from the dominant sound audio signal at each time, phoneme network storage means for storing a phoneme network constituted from a plurality of phonemes and short pauses in respect of lyrics in music corresponding to the music audio signal, and alignment means for performing an alignment operation that makes temporal alignment between the plurality of phonemes in the phoneme network and the dominant sound audio signals, the alignment means being provided with a phone model for singing voice that estimates a phoneme corresponding to the temporal-alignment feature, based on the temporal-alignment feature, wherein the alignment means receives the temporal-alignment feature outputted from the temporal-alignment feature extraction means, the information on the vocal section and the non-vocal section, and the phoneme network, and performs the alignment operation on condition that no phoneme exists at least in the non-vocal section. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method of automatically making temporal alignment between music audio signal and lyrics, comprising the steps of:
-
extracting a dominant sound audio signal of the most dominant sound from a music audio signal of music at each time with dominant sound audio signal extraction means, wherein the most dominant sound includes a vocal from the music, the music including vocals and accompaniment sounds, extracting a vocal-section feature available to estimate a vocal section and a non-vocal section from the dominant sound audio signal at each time with vocal-section feature extraction means, wherein the vocal section includes the vocal and the non-vocal section does not include the vocal, estimating the vocal section and the non-vocal section and outputting information on the vocal section and the non-vocal section with vocal section estimation means, wherein the vocal and non-vocal sections are estimated based on a plurality of the vocal-section features, extracting a temporal-alignment feature suitable to make temporal alignment between lyrics of the vocal and the music audio signal, from the dominant sound audio signal at each time, with temporal-alignment feature extraction means, storing a phoneme network with phoneme network storage means, the phoneme network being constituted from a plurality of phonemes and short pauses in respect of lyrics in music corresponding to the music audio signal, and performing an alignment operation, which makes the temporal alignment between the plurality of phonemes in the phoneme network and the dominant sound audio signals, with alignment means, wherein the alignment means is provided with a phone model for singing voice that estimates a phoneme corresponding to the temporal-alignment feature, based on the temporal-alignment feature, and the alignment means receives the temporal-alignment feature obtained in the step of extracting the temporal-alignment feature, the information on the vocal section and the non-vocal section, and the phoneme network, and then performs the alignment operation on condition that no phoneme exists at least in the non-vocal section.
-
-
12. A computer program for temporal alignment between music audio signal and lyrics, causing a computer to implement:
-
dominant sound audio signal extraction means for extracting, from a music audio signal of music including vocals and accompaniment sounds, a dominant sound audio signal of the most dominant sound including the vocal at each time, vocal-section feature extraction means for extracting a vocal-section feature available to estimate a vocal section which includes the vocal and a non-vocal section which does not include the vocal, from the dominant sound audio signal at each time, vocal section estimation means for estimating the vocal section and the non-vocal section, based on a plurality of the vocal-section features and outputting information on the vocal section and the non-vocal section, temporal-alignment feature extraction means for extracting a temporal-alignment feature suitable to make temporal alignment between lyrics of the vocal and the music audio signal from the dominant sound audio signal at each time, phoneme network storage means for storing a phoneme network constituted from a plurality of phonemes and short pauses in respect of lyrics in music corresponding to the music audio signal, and alignment means for performing an alignment operation that makes the temporal alignment between the plurality of phonemes in the phoneme network and the dominant sound audio signals, the alignment means being provided with a phone model for singing voice that estimates a phoneme corresponding to the temporal-alignment feature, based on the temporal-alignment feature, wherein the alignment means receives the temporal-alignment feature outputted from the temporal-alignment feature extraction means, the information on the vocal section and the non-vocal section, and the phoneme network, and performs the alignment operation on condition that no phoneme exists at least in the non-vocal section. - View Dependent Claims (13)
-
Specification