This is actually not a trivial task. I do not think any off-the-shelf library can do it. Here is a possible approach:
- Decode mp3 to PCM.
- Ensure that PCM data has specific sample rate, which you choose beforehand (e.g. 16KHz). You'll need to resample songs that have different sample rate. High sample rate is not required since you need a fuzzy comparison anyway, but too low sample rate will lose too much details.
- Normalize PCM data (i.e. find maximum sample value and rescale all samples so that sample with largest amplitude uses entire dynamic range of data format, e.g. if sample format is signed 16 bit, then after normalization max. amplitude sample should have value 32767 or -32767).
- Split audio data into frames of fixed number of samples (e.g.: 1000 samples per frame).
- Convert each frame to spectrum domain (FFT).
- Calculate correlation between sequences of frames representing two songs. If correllation is greater than a certain threshold, assume the songs are the same.
Python libraries:
- PyMedia (for step 1)
- NumPy (for data processing) -- also see this article for some introductory info
An additional complication. Your songs may have a different length of silence at the beginning. So to avoid false negatives, you may need an additional step:
3.1. Scan PCM data from the beginning, until sound energy exceeds predefined threshold. (E.g. calculate RMS with a sliding window of 10 samples and stop when it exceeds 1% of dynamic range). Then discard all data until this point.