问题
I'm trying to write a Python script for searching out duplicate mp3/4 files using the song's data as the base for comparison. My situation involves many mp3/4 files with similar file names, but different ID3 tags. At first I tried looping through and using md5 to find duplicate files (ignoring file names). This, of course, didn't work when the ID3 tags didn't match.
As a result, I'm looking for a way to extract only the music data from an mp3/4 in order to run it through md5 and find any duplicates. What is the best way to go about this?
回答1:
Try using id3-py or mutagen to strip out all the tags (both ID3v1 and ID3v2, they can both be on the same file), then computing the MD5 on the result.
Assuming iTunes didn't manipulate the file beyond tags they should be identical. Transcoding obviously would make this approach invalid.
回答2:
Use some fingerprint algorithm. You might know about MusicBrainz. They have listed here some fingerprint algorithms. They use AcoustId now which is probably the thing you should also use (it's good and it's free). There is the Chromaprint library which can generate such a fingerprint.
I wrote a Python module ffmpeg which does the decoding via FFmpeg and provides a simple function to calculate the AcoustId fingerprint (using Chromaprint). Here is a small demo for that (which even queries MusicBrainz for the song).
It should be easy to build some tool using that to find all duplicates.
The fingerprint will be exactly the same if the audio data is exactly the same. It will be similar if the audio data is similar. See the AcoustId homepage for further information how you calculate the similarity if you don't just want to check for equality.
回答3:
That's actually pretty advanced, fuzzy logic-type stuff you're asking about.
This isn't an answer but take a look at the discussion in this article: Detect duplicate MP3 files with different bitrates and/or different ID3 tags? (It might qualify as a dupe actually... It's even Python-specific.)
来源:https://stackoverflow.com/questions/3250696/access-mp3-music-data-using-python