MODELING OF THE LATENT EMBEDDING OF MUSIC USING DEEP NEURAL NETWORK
First Claim
1. A method of estimating song features, the method comprising:
- an audio receiver receiving a first training audio file;
generating, with one or more processors, a first waveform associated with the first training audio file;
generating, with the one or more processors, one or more frequency transformations from the first waveform;
generating, with the one or more processors, a hyper-image from the one or more frequency transformations;
processing, with a convolutional neural network, the hyper-image;
estimating, with the one or more processors, an error in an output of the convolutional neural network;
optimizing, with the one or more processors, one or more weights associated with the convolutional neural network based on the estimated error; and
using the convolutional neural network to estimate a feature of a testing audio file.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems are provided for detecting and cataloging qualities in music. While both the data volume and heterogeneity of the digital music content is huge, it has become increasingly important and convenient to build a recommendation or search system to facilitate surfacing these content to the user or consumer community. Embodiments use deep convolutional neural network to imitate how human brain processes hierarchical structures in the auditory signals, such as music, speech, etc., at various timescales. This approach can be used to discover the latent factor models of the music based upon acoustic hyper-images that are extracted from the raw audio waves of music. These latent embeddings can be used either as features to feed to subsequent models, such as collaborative filtering, or to build similarity metrics between songs, or to classify music based on the labels for training such as genre, mood, sentiment, etc.
47 Citations
20 Claims
-
1. A method of estimating song features, the method comprising:
-
an audio receiver receiving a first training audio file; generating, with one or more processors, a first waveform associated with the first training audio file; generating, with the one or more processors, one or more frequency transformations from the first waveform; generating, with the one or more processors, a hyper-image from the one or more frequency transformations; processing, with a convolutional neural network, the hyper-image; estimating, with the one or more processors, an error in an output of the convolutional neural network; optimizing, with the one or more processors, one or more weights associated with the convolutional neural network based on the estimated error; and using the convolutional neural network to estimate a feature of a testing audio file. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system, comprising:
-
a processor; and a computer-readable storage medium storing computer-readable instructions, which when executed by the processor, cause the processor to perform; generating a first waveform associated with a first training audio file; generating one or more frequency transformations from the first waveform; generating a hyper-image from the one or more frequency transformations; processing, with a convolutional neural network, the hyper-image; estimating an error in an output of the convolutional neural network; optimizing one or more weights associated with the convolutional neural network based on the estimated error; and using the convolutional neural network to estimate a feature of a testing audio file. - View Dependent Claims (9, 13, 14)
-
-
10. The system, wherein the one or more weights associated with the convolutional neural network are further optimized with a second training audio file.
- 11. The system, wherein the error is estimated based on one or more tags associated with the first training audio file.
-
15. A computer program product, comprising:
a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising; computer readable program code configured when executed by a processor to; generate a first waveform associated with a first training audio file; generate one or more frequency transformations from the first waveform; generate a hyper-image from the one or more frequency transformations; process, with a convolutional neural network, the hyper-image; estimate an error in an output of the convolutional neural network; optimize one or more weights associated with the convolutional neural network based on the estimated error; and use the convolutional neural network to estimate a feature of a testing audio file. - View Dependent Claims (16, 17, 18, 19, 20)
Specification