Finding duplicate files and removing them

后端 未结 8 1705
谎友^
谎友^ 2020-11-27 09:26

I am writing a Python program to find and remove duplicate files from a folder.

I have multiple copies of mp3 files, and some other files. I am using the sh1 algorit

相关标签:
8条回答
  • 2020-11-27 09:57

    Fastest algorithm - 100x performance increase compared to the accepted answer (really :))

    The approaches in the other solutions are very cool, but they forget about an important property of duplicate files - they have the same file size. Calculating the expensive hash only on files with the same size will save tremendous amount of CPU; performance comparisons at the end, here's the explanation.

    Iterating on the solid answers given by @nosklo and borrowing the idea of @Raffi to have a fast hash of just the beginning of each file, and calculating the full one only on collisions in the fast hash, here are the steps:

    1. Buildup a hash table of the files, where the filesize is the key.
    2. For files with the same size, create a hash table with the hash of their first 1024 bytes; non-colliding elements are unique
    3. For files with the same hash on the first 1k bytes, calculate the hash on the full contents - files with matching ones are NOT unique.

    The code:

    #!/usr/bin/env python
    # if running in py3, change the shebang, drop the next import for readability (it does no harm in py3)
    from __future__ import print_function   # py2 compatibility
    from collections import defaultdict
    import hashlib
    import os
    import sys
    
    
    def chunk_reader(fobj, chunk_size=1024):
        """Generator that reads a file in chunks of bytes"""
        while True:
            chunk = fobj.read(chunk_size)
            if not chunk:
                return
            yield chunk
    
    
    def get_hash(filename, first_chunk_only=False, hash=hashlib.sha1):
        hashobj = hash()
        file_object = open(filename, 'rb')
    
        if first_chunk_only:
            hashobj.update(file_object.read(1024))
        else:
            for chunk in chunk_reader(file_object):
                hashobj.update(chunk)
        hashed = hashobj.digest()
    
        file_object.close()
        return hashed
    
    
    def check_for_duplicates(paths, hash=hashlib.sha1):
        hashes_by_size = defaultdict(list)  # dict of size_in_bytes: [full_path_to_file1, full_path_to_file2, ]
        hashes_on_1k = defaultdict(list)  # dict of (hash1k, size_in_bytes): [full_path_to_file1, full_path_to_file2, ]
        hashes_full = {}   # dict of full_file_hash: full_path_to_file_string
    
        for path in paths:
            for dirpath, dirnames, filenames in os.walk(path):
                # get all files that have the same size - they are the collision candidates
                for filename in filenames:
                    full_path = os.path.join(dirpath, filename)
                    try:
                        # if the target is a symlink (soft one), this will 
                        # dereference it - change the value to the actual target file
                        full_path = os.path.realpath(full_path)
                        file_size = os.path.getsize(full_path)
                        hashes_by_size[file_size].append(full_path)
                    except (OSError,):
                        # not accessible (permissions, etc) - pass on
                        continue
    
    
        # For all files with the same file size, get their hash on the 1st 1024 bytes only
        for size_in_bytes, files in hashes_by_size.items():
            if len(files) < 2:
                continue    # this file size is unique, no need to spend CPU cycles on it
    
            for filename in files:
                try:
                    small_hash = get_hash(filename, first_chunk_only=True)
                    # the key is the hash on the first 1024 bytes plus the size - to
                    # avoid collisions on equal hashes in the first part of the file
                    # credits to @Futal for the optimization
                    hashes_on_1k[(small_hash, size_in_bytes)].append(filename)
                except (OSError,):
                    # the file access might've changed till the exec point got here 
                    continue
    
        # For all files with the hash on the 1st 1024 bytes, get their hash on the full file - collisions will be duplicates
        for __, files_list in hashes_on_1k.items():
            if len(files_list) < 2:
                continue    # this hash of fist 1k file bytes is unique, no need to spend cpy cycles on it
    
            for filename in files_list:
                try: 
                    full_hash = get_hash(filename, first_chunk_only=False)
                    duplicate = hashes_full.get(full_hash)
                    if duplicate:
                        print("Duplicate found: {} and {}".format(filename, duplicate))
                    else:
                        hashes_full[full_hash] = filename
                except (OSError,):
                    # the file access might've changed till the exec point got here 
                    continue
    
    
    if __name__ == "__main__":
        if sys.argv[1:]:
            check_for_duplicates(sys.argv[1:])
        else:
            print("Please pass the paths to check as parameters to the script")
    

    And, here's the fun part - performance comparisons.

    Baseline -

    • a directory with 1047 files, 32 mp4, 1015 - jpg, total size - 5445.998 MiB - i.e. my phone's camera auto upload directory :)
    • small (but fully functional) processor - 1600 BogoMIPS, 1.2 GHz 32L1 + 256L2 Kbs cache, /proc/cpuinfo:

    Processor : Feroceon 88FR131 rev 1 (v5l) BogoMIPS : 1599.07

    (i.e. my low-end NAS :), running Python 2.7.11.

    So, the output of @nosklo's very handy solution:

    root@NAS:InstantUpload# time ~/scripts/checkDuplicates.py 
    Duplicate found: ./IMG_20151231_143053 (2).jpg and ./IMG_20151231_143053.jpg
    Duplicate found: ./IMG_20151125_233019 (2).jpg and ./IMG_20151125_233019.jpg
    Duplicate found: ./IMG_20160204_150311.jpg and ./IMG_20160204_150311 (2).jpg
    Duplicate found: ./IMG_20160216_074620 (2).jpg and ./IMG_20160216_074620.jpg
    
    real    5m44.198s
    user    4m44.550s
    sys     0m33.530s
    

    And, here's the version with filter on size check, then small hashes, and finally full hash if collisions are found:

    root@NAS:InstantUpload# time ~/scripts/checkDuplicatesSmallHash.py . "/i-data/51608399/photo/Todor phone"
    Duplicate found: ./IMG_20160216_074620 (2).jpg and ./IMG_20160216_074620.jpg
    Duplicate found: ./IMG_20160204_150311.jpg and ./IMG_20160204_150311 (2).jpg
    Duplicate found: ./IMG_20151231_143053 (2).jpg and ./IMG_20151231_143053.jpg
    Duplicate found: ./IMG_20151125_233019 (2).jpg and ./IMG_20151125_233019.jpg
    
    real    0m1.398s
    user    0m1.200s
    sys     0m0.080s
    

    Both versions were ran 3 times each, to get the avg of the time needed.

    So v1 is (user+sys) 284s, the other - 2s; quite a diff, huh :) With this increase, one could go to SHA512, or even fancier - the perf penalty will be mitigated by the less calculations needed.

    Negatives:

    • More disk access than the other versions - every file is accessed once for size stats (that's cheap, but still is disk IO), and every duplicate is opened twice (for the small first 1k bytes hash, and for the full contents hash)
    • Will consume more memory due to storing the hash tables runtime
    0 讨论(0)
  • 2020-11-27 09:58

    Recursive folders version:

    This version uses the file size and a hash of the contents to find duplicates. You can pass it multiple paths, it will scan all paths recursively and report all duplicates found.

    import sys
    import os
    import hashlib
    
    def chunk_reader(fobj, chunk_size=1024):
        """Generator that reads a file in chunks of bytes"""
        while True:
            chunk = fobj.read(chunk_size)
            if not chunk:
                return
            yield chunk
    
    def check_for_duplicates(paths, hash=hashlib.sha1):
        hashes = {}
        for path in paths:
            for dirpath, dirnames, filenames in os.walk(path):
                for filename in filenames:
                    full_path = os.path.join(dirpath, filename)
                    hashobj = hash()
                    for chunk in chunk_reader(open(full_path, 'rb')):
                        hashobj.update(chunk)
                    file_id = (hashobj.digest(), os.path.getsize(full_path))
                    duplicate = hashes.get(file_id, None)
                    if duplicate:
                        print "Duplicate found: %s and %s" % (full_path, duplicate)
                    else:
                        hashes[file_id] = full_path
    
    if sys.argv[1:]:
        check_for_duplicates(sys.argv[1:])
    else:
        print "Please pass the paths to check as parameters to the script"
    
    0 讨论(0)
  • 2020-11-27 09:59
        import hashlib
        import os
        import sys
        from sets import Set
    
        def read_chunk(fobj, chunk_size = 2048):
            """ Files can be huge so read them in chunks of bytes. """
            while True:
                chunk = fobj.read(chunk_size)
                if not chunk:
                    return
                yield chunk
    
        def remove_duplicates(dir, hashfun = hashlib.sha512):
            unique = Set()
            for filename in os.listdir(dir):
                filepath = os.path.join(dir, filename)
                if os.path.isfile(filepath):
                    hashobj = hashfun()
                    for chunk in read_chunk(open(filepath,'rb')):
                        hashobj.update(chunk)
                        # the size of the hashobj is constant
                        # print "hashfun: ", hashfun.__sizeof__()
                    hashfile = hashobj.hexdigest()
                    if hashfile not in unique:
                        unique.add(hashfile)
                    else: 
                        os.remove(filepath)
    
        try:
            hashfun = hashlib.sha256
            remove_duplicates(sys.argv[1], hashfun)
        except IndexError:
            print """Please pass a path to a directory with 
            duplicate files as a parameter to the script."""
    
    0 讨论(0)
  • 2020-11-27 10:00

    @IanLee1521 has a nice solution here. It is very efficient because it checks the duplicate based on the file size first.

    #! /usr/bin/env python
    
    # Originally taken from:
    # http://www.pythoncentral.io/finding-duplicate-files-with-python/
    # Original Auther: Andres Torres
    
    # Adapted to only compute the md5sum of files with the same size
    
    import argparse
    import os
    import sys
    import hashlib
    
    
    def find_duplicates(folders):
        """
        Takes in an iterable of folders and prints & returns the duplicate files
        """
        dup_size = {}
        for i in folders:
            # Iterate the folders given
            if os.path.exists(i):
                # Find the duplicated files and append them to dup_size
                join_dicts(dup_size, find_duplicate_size(i))
            else:
                print('%s is not a valid path, please verify' % i)
                return {}
    
        print('Comparing files with the same size...')
        dups = {}
        for dup_list in dup_size.values():
            if len(dup_list) > 1:
                join_dicts(dups, find_duplicate_hash(dup_list))
        print_results(dups)
        return dups
    
    
    def find_duplicate_size(parent_dir):
        # Dups in format {hash:[names]}
        dups = {}
        for dirName, subdirs, fileList in os.walk(parent_dir):
            print('Scanning %s...' % dirName)
            for filename in fileList:
                # Get the path to the file
                path = os.path.join(dirName, filename)
                # Check to make sure the path is valid.
                if not os.path.exists(path):
                    continue
                # Calculate sizes
                file_size = os.path.getsize(path)
                # Add or append the file path
                if file_size in dups:
                    dups[file_size].append(path)
                else:
                    dups[file_size] = [path]
        return dups
    
    
    def find_duplicate_hash(file_list):
        print('Comparing: ')
        for filename in file_list:
            print('    {}'.format(filename))
        dups = {}
        for path in file_list:
            file_hash = hashfile(path)
            if file_hash in dups:
                dups[file_hash].append(path)
            else:
                dups[file_hash] = [path]
        return dups
    
    
    # Joins two dictionaries
    def join_dicts(dict1, dict2):
        for key in dict2.keys():
            if key in dict1:
                dict1[key] = dict1[key] + dict2[key]
            else:
                dict1[key] = dict2[key]
    
    
    def hashfile(path, blocksize=65536):
        afile = open(path, 'rb')
        hasher = hashlib.md5()
        buf = afile.read(blocksize)
        while len(buf) > 0:
            hasher.update(buf)
            buf = afile.read(blocksize)
        afile.close()
        return hasher.hexdigest()
    
    
    def print_results(dict1):
        results = list(filter(lambda x: len(x) > 1, dict1.values()))
        if len(results) > 0:
            print('Duplicates Found:')
            print(
                'The following files are identical. The name could differ, but the'
                ' content is identical'
                )
            print('___________________')
            for result in results:
                for subresult in result:
                    print('\t\t%s' % subresult)
                print('___________________')
    
        else:
            print('No duplicate files found.')
    
    
    def main():
        parser = argparse.ArgumentParser(description='Find duplicate files')
        parser.add_argument(
            'folders', metavar='dir', type=str, nargs='+',
            help='A directory to parse for duplicates',
            )
        args = parser.parse_args()
    
        find_duplicates(args.folders)
    
    
    if __name__ == '__main__':
        sys.exit(main())
    
    0 讨论(0)
  • 2020-11-27 10:05
    def remove_duplicates(dir):
        unique = []
        for filename in os.listdir(dir):
            if os.path.isfile(filename):
                filehash = md5.md5(file(filename).read()).hexdigest()
                if filehash not in unique: 
                    unique.append(filehash)
                else: 
                    os.remove(filename)
    

    //edit:

    For MP3 you may be also interested in this topic Detect duplicate MP3 files with different bitrates and/or different ID3 tags?

    0 讨论(0)
  • 2020-11-27 10:10

    Faster algorithm

    In case many files of 'big size' should be analyzed (images, mp3, pdf documents), it would be interesting/faster to have the following comparison algorithm:

    1. a first fast hash is performed on the first N bytes of the file (say 1KB). This hash would say if files are different without doubt, but will not say if two files are exactly the same (accuracy of the hash, limited data read from disk)

    2. a second, slower, hash, which is more accurate and performed on the whole content of the file, if a collision occurs in the first stage

    Here is an implementation of this algorithm:

    import hashlib
    def Checksum(current_file_name, check_type = 'sha512', first_block = False):
      """Computes the hash for the given file. If first_block is True,
      only the first block of size size_block is hashed."""
      size_block = 1024 * 1024 # The first N bytes (1KB)
    
      d = {'sha1' : hashlib.sha1, 'md5': hashlib.md5, 'sha512': hashlib.sha512}
    
      if(not d.has_key(check_type)):
        raise Exception("Unknown checksum method")
    
      file_size = os.stat(current_file_name)[stat.ST_SIZE]
      with file(current_file_name, 'rb') as f:
        key = d[check_type].__call__()
        while True:
          s = f.read(size_block)
          key.update(s)
          file_size -= size_block
          if(len(s) < size_block or first_block):
            break
      return key.hexdigest().upper()
    
    def find_duplicates(files):
      """Find duplicates among a set of files.
      The implementation uses two types of hashes:
      - A small and fast one one the first block of the file (first 1KB), 
      - and in case of collision a complete hash on the file. The complete hash 
      is not computed twice.
      It flushes the files that seems to have the same content 
      (according to the hash method) at the end.
      """
    
      print 'Analyzing', len(files), 'files'
    
      # this dictionary will receive small hashes
      d = {}
      # this dictionary will receive full hashes. It is filled
      # only in case of collision on the small hash (contains at least two 
      # elements)
      duplicates = {}
    
      for f in files:
    
        # small hash to be fast
        check = Checksum(f, first_block = True, check_type = 'sha1')
    
        if(not d.has_key(check)):
          # d[check] is a list of files that have the same small hash
          d[check] = [(f, None)]
        else:
          l = d[check]
          l.append((f, None))
    
          for index, (ff, checkfull) in enumerate(l):
    
            if(checkfull is None):
              # computes the full hash in case of collision
              checkfull = Checksum(ff, first_block = False)
              l[index] = (ff, checkfull)
    
              # for each new full hash computed, check if their is 
              # a collision in the duplicate dictionary. 
              if(not duplicates.has_key(checkfull)):
                duplicates[checkfull] = [ff]
              else:
                duplicates[checkfull].append(ff)
    
      # prints the detected duplicates
      if(len(duplicates) != 0):
        print
        print "The following files have the same sha512 hash"
    
        for h, lf in duplicates.items():
          if(len(lf)==1):
            continue
          print 'Hash value', h
          for f in lf:
            print '\t', f.encode('unicode_escape') if \
              type(f) is types.UnicodeType else f
      return duplicates
    

    The find_duplicates function takes a list of files. This way, it is also possible to compare two directories (for instance, to better synchronize their content.) An example of function creating a list of files, with specified extension, and avoiding entering in some directories, is below:

    def getFiles(_path, extensions = ['.png'], 
                 subdirs = False, avoid_directories = None):
      """Returns the list of files in the path :'_path', 
         of extension in 'extensions'. 'subdir' indicates if 
         the search should also be performed in the subdirectories. 
         If extensions = [] or None, all files are returned.
         avoid_directories: if set, do not parse subdirectories that 
         match any element of avoid_directories."""
    
      l = []
      extensions = [p.lower() for p in extensions] if not extensions is None \
        else None
      for root, dirs, files in os.walk(_path, topdown=True):
    
        for name in files:
          if(extensions is None or len(extensions) == 0 or \
             os.path.splitext(name)[1].lower() in extensions):
            l.append(os.path.join(root, name))
    
        if(not subdirs):
          while(len(dirs) > 0):
            dirs.pop()
        elif(not avoid_directories is None):
          for d in avoid_directories:
            if(d in dirs): dirs.remove(d)
    
      return l    
    

    This method is convenient for not parsing .svn paths for instance, which surely will trigger colliding files in find_duplicates.

    Feedbacks are welcome.

    0 讨论(0)
提交回复
热议问题