What is the difference between the way Essentia and Librosa generate MFCCs?

I have been working on a music genre classification project for some time now and from the literature, I figured that MFCCs are the best features to start with. Though there are various libraries that implement the feature extraction, my focus has been on librosa and essentia.

Disclaimer:
This is not a piece that aims to answer the question but merely shed more light on why it is being asked and get responses.

MFCC

MFCC stands for Mel Frequency Cepstral Coefficient which is a fundamental audio feature. The MFCC uses the MEL scale to divide the frequency band to sub-bands and then extracts the Cepstral Coefficients using Discrete Cosine Transform (DCT). The MEL scale is based on the way humans distinguish between frequencies which makes it very convenient to process sounds.

It is a scale of pitches judged by listeners to be equal in distance one from another. Because of how humans perceive sound, the MEL scale is a non-linear scale and the distances between the pitches increases with frequency.

LIBROSA

librosa is an API for feature extraction and processing data in Python. librosa.feature.mfcc is a method that simplifies the process of obtaining MFCCs by providing arguments to set the number of frames, hop length, number of MFCCs and so on. Based on the arguments that are set, a 2D array is returned.

ESSENTIA

essentia is a full function workflow environment for high and low level features, facilitating audio input, preprocessing and statistical analysis of output. It was written in C++ with Python binding and exports data in YAML or JSON format.

The essentia.standard.MFCC function has a parameter to fix the number of coefficients in the MFCC but processes the entire file in one go returning a 1D array. The library however also has a FrameGenerator method that takes in other parameters which could make it yield similar results with librosa.

Making Essentia's MFCCs like Librosa

I used the FrameGenerator method to set other parameters like the hop length, number of frames and number of MFCCs to be the same as those used with librosa. Also, the sample rate and windowing type were modified to be the same for both libraries.
I then used both functions to generate MFCCs of the same shape for 20 tracks. Two of these are visualized below.
MFCC of SongMFCC of Song
MFCC of SongMFCC of Song

My observation was that even with this modification, essentia was still about 2 times faster than librosa (this was the primary metric I wanted to compare). However, I also noticed something else. The MFCCs did not look the same.

How different are the MFCCs from Librosa and Essentia?

Upon seeing the visual difference between them, I found the cosine similarity between the two MFCCs with the aim of quantifying it. For the two tracks displayed, the similarities were:

  • Africa Yako: 0.9019551277160645
  • So To Where: 0.9127510786056519

Generally, the similarities ranged between 0.90 and 0.94.

If you know the reason for this difference between the MFCCs or perhaps can identify a parameter that I am not considering, please do not hesitate to drop a comment. Thanks.

References:

56