Detect duplicate MP3 files with different bitrates and/or different ID3 tags?

99封情书 提交于 2019-11-27 07:20:17

The exact same question that people at the old AudioScrobbler and currently at MusicBrainz have worked on since long ago. For the time being, the Python project that can aid in your quest, is Picard, which will tag audio files (not only MPEG 1 Layer 3 files) with a GUID (actually, several of them), and from then on, matching the tags is quite simple.

If you prefer to do it as a project of your own, libofa might be of help.

Like the others said, simple checksums won't detect duplicates with different bitrates or ID3 tags. What you need is an audio fingerprint algorithm. The Python Audioprocessing Suite has such an an algorithm, but I can't say anything about how reliable it is.

http://rudd-o.com/new-projects/python-audioprocessing

For tag issues, Picard may indeed be a very good bet. If, having identified two potentially duplicate files, what you want is to extract bitrate information from them, have a look at mp3guessenc.

I don't think simple checksums will ever work:

  1. ID3 tags will affect the md5
  2. Different encoders will encode the same song different ways - so the checksums will be different
  3. Different bit-rates will produce different checksums
  4. Re-encoding an mp3 to a different bit-rate will probably sound terrible and will certainly be different to the original audio compressed in one step.

I think you'll have to compare ID3 tags, song length, and filenames.

Re-encoding at the same bit rate won't work, in fact it may make things worse as transcoding (that is what re-encoding at different bitrates is called) is going to change the nature of the compression, you are recompressing an already compressed file is going to lead to a significantly different file.

This is a little out of my league but I would approach the problem by looking at the wave pattern of the MP3. Either by converting the MP3 to an uncompressd .wav or maybe by just running the analysis on the MP3 file itself. There should be a library out there for this. Just a word of warning, this is an expensive operation.

Another idea, use ReplayGain to scan the files. If they are the same song, they should be be tagged with the same gain. This will only work on the exact same song from the exact same album. I know of several cases were reissues are remastered at a higher volume, thus changing the replaygain.

EDIT:
You might want to check out http://www.speech.kth.se/snack/, which apparently can do spectrogram visualization. I imagine any library that can visual spectrogram can help you compare them.

This link from the official python page may also be helpful.

The Dejavu project is written in Python and does exactly what you are looking for.

https://github.com/worldveil/dejavu

It also supports many common formats (.wav, .mp3, etc) as well as finding the time offset of the clip in the original audio track.

I'm looking for something similar and I found this:
http://www.lastfm.es/user/nova77LF/journal/2007/10/12/4kaf_fingerprint_(command_line)_client

Hope it helps.

splicer

I'd use length as my primary heuristic. That's what iTunes does when it's trying to identify a CD using the Gracenote database. Measure the lengths in milliseconds rather than seconds. Remember, this is only a heuristic: you should definitely listen to any detected duplicates before deleting them.

You can use the successor for PUID and MusicBrainz, called AcoustiD:

AcoustID is an open source project that aims to create a free database of audio fingerprints with mapping to the MusicBrainz metadata database and provide a web service for audio file identification using this database...

...fingerprints along with some metadata necessary to identify the songs to the AcoustID database...

You will find various client libraries and examples for the webservice at https://acoustid.org/

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!