I am writing a Python program to find and remove duplicate files from a folder.
I have multiple copies of mp3 files, and some other files. I am using the sh1 algorit
In order to be safe (removing them automatically can be dangerous if something goes wrong!), here is what I use, based on @zalew's answer.
Pleas also note that the md5 sum code is slightly different from @zalew's because his code generated too many wrong duplicate files (that's why I said removing them automatically is dangerous!).
import hashlib, os
unique = dict()
for filename in os.listdir('.'):
if os.path.isfile(filename):
filehash = hashlib.md5(open(filename, 'rb').read()).hexdigest()
if filehash not in unique:
unique[filehash] = filename
else:
print filename + ' is a duplicate of ' + unique[filehash]
I wrote one in Python some time ago -- you're welcome to use it.
import sys
import os
import hashlib
check_path = (lambda filepath, hashes, p = sys.stdout.write:
(lambda hash = hashlib.sha1 (file (filepath).read ()).hexdigest ():
((hash in hashes) and (p ('DUPLICATE FILE\n'
' %s\n'
'of %s\n' % (filepath, hashes[hash])))
or hashes.setdefault (hash, filepath)))())
scan = (lambda dirpath, hashes = {}:
map (lambda (root, dirs, files):
map (lambda filename: check_path (os.path.join (root, filename), hashes), files), os.walk (dirpath)))
((len (sys.argv) > 1) and scan (sys.argv[1]))