Finding duplicate files and removing them

后端 未结 8 1713
谎友^
谎友^ 2020-11-27 09:26

I am writing a Python program to find and remove duplicate files from a folder.

I have multiple copies of mp3 files, and some other files. I am using the sh1 algorit

8条回答
  •  再見小時候
    2020-11-27 09:58

    Recursive folders version:

    This version uses the file size and a hash of the contents to find duplicates. You can pass it multiple paths, it will scan all paths recursively and report all duplicates found.

    import sys
    import os
    import hashlib
    
    def chunk_reader(fobj, chunk_size=1024):
        """Generator that reads a file in chunks of bytes"""
        while True:
            chunk = fobj.read(chunk_size)
            if not chunk:
                return
            yield chunk
    
    def check_for_duplicates(paths, hash=hashlib.sha1):
        hashes = {}
        for path in paths:
            for dirpath, dirnames, filenames in os.walk(path):
                for filename in filenames:
                    full_path = os.path.join(dirpath, filename)
                    hashobj = hash()
                    for chunk in chunk_reader(open(full_path, 'rb')):
                        hashobj.update(chunk)
                    file_id = (hashobj.digest(), os.path.getsize(full_path))
                    duplicate = hashes.get(file_id, None)
                    if duplicate:
                        print "Duplicate found: %s and %s" % (full_path, duplicate)
                    else:
                        hashes[file_id] = full_path
    
    if sys.argv[1:]:
        check_for_duplicates(sys.argv[1:])
    else:
        print "Please pass the paths to check as parameters to the script"
    

提交回复
热议问题