问题
In looking at the output of this line of code:
mfccs = librosa.feature.mfcc(y=librosa_audio, sr=librosa_sample_rate, n_mfcc=40)
print("MFCC Shape = ", mfccs.shape)
I get a response of MFCC Shape = (40,1876)
. What do these two numbers represent? I looked at the librosa website but still could not decipher what are these two values.
Any insights will be greatly appreciated!
回答1:
The first dimension (40) is the number of MFCC coefficients, and the second dimensions (1876) is the number of time frames. The number of MFCC is specified by n_mfcc
, and the number of time frames is given by the length of the audio (in samples) divided by the hop_length
.
To understand the meaning of the MFCCs themselves, you should understand the steps it takes to compute them:
- Spectrograms, using the Short-Time-Fourier-Transform (STFT)
- The Mel spectrogram, from applying Mel scale filterbanks to the STFT
- Mel Frequency Cepstral Coefficients, from applying the DCT transform on the mel-spectrogram.
A good written explainer is Haytham Fayek: Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What's In-Between and a good video explainer is The Sound of AI: Mel-Frequency Cepstral Coefficients Explained Easily.
来源:https://stackoverflow.com/questions/65206575/what-are-the-components-of-the-mel-mfcc