56
What is the difference between the way Essentia and Librosa generate MFCCs?
I have been working on a music genre classification project for some time now and from the literature, I figured that MFCCs are the best features to start with. Though there are various libraries that implement the feature extraction, my focus has been on librosa
and essentia
.
Disclaimer:
This is not a piece that aims to answer the question but merely shed more light on why it is being asked and get responses.
MFCC stands for Mel Frequency Cepstral Coefficient which is a fundamental audio feature. The MFCC uses the MEL scale to divide the frequency band to sub-bands and then extracts the Cepstral Coefficients using Discrete Cosine Transform (DCT). The MEL scale is based on the way humans distinguish between frequencies which makes it very convenient to process sounds.
It is a scale of pitches judged by listeners to be equal in distance one from another. Because of how humans perceive sound, the MEL scale is a non-linear scale and the distances between the pitches increases with frequency.
librosa
is an API for feature extraction and processing data in Python. librosa.feature.mfcc
is a method that simplifies the process of obtaining MFCCs by providing arguments to set the number of frames, hop length, number of MFCCs and so on. Based on the arguments that are set, a 2D array is returned.
essentia
is a full function workflow environment for high and low level features, facilitating audio input, preprocessing and statistical analysis of output. It was written in C++ with Python binding and exports data in YAML or JSON format.
The essentia.standard.MFCC
function has a parameter to fix the number of coefficients in the MFCC but processes the entire file in one go returning a 1D array. The library however also has a FrameGenerator
method that takes in other parameters which could make it yield similar results with librosa
.
I used the FrameGenerator
method to set other parameters like the hop length, number of frames and number of MFCCs to be the same as those used with librosa. Also, the sample rate and windowing type were modified to be the same for both libraries.
I then used both functions to generate MFCCs of the same shape for 20 tracks. Two of these are visualized below.
My observation was that even with this modification, essentia
was still about 2 times faster than librosa
(this was the primary metric I wanted to compare). However, I also noticed something else. The MFCCs did not look the same.
Upon seeing the visual difference between them, I found the cosine similarity between the two MFCCs with the aim of quantifying it. For the two tracks displayed, the similarities were:
-
Africa Yako:
0.9019551277160645
-
So To Where:
0.9127510786056519
Generally, the similarities ranged between 0.90
and 0.94
.
If you know the reason for this difference between the MFCCs or perhaps can identify a parameter that I am not considering, please do not hesitate to drop a comment. Thanks.
56