问题
I'm dealing with a large amount (30,000) files of about 10MB in size. Some of them (I estimate 2%) are actually duplicated, and I need to keep only a copy for every duplicated pair (or triplet). Would you suggest me an efficient way to do that? I'm working on unix.
回答1:
There is an existing tool for this: fdupes
Restoring a solution from an old deleted answer.
回答2:
you can try this snippet to get all duplicates first before removing.
find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in seen)){seen[$1]=$2}'
回答3:
I would write a script to create a hash of every file. You could store the hashes in a set, iterate over the files, and where a file hashes to a value already found in the set, delete the file. This would be trivial to do in Python, for example.
For 30,000 files, at 64 bytes per hash table entry, you're only looking at about 200 megabytes.
回答4:
Find possible duplicate files:
find DIR -type f -exec sha1sum "{}" \; | sort | uniq -d -w40
Now you can use cmp
to check that the files are really identical.
回答5:
Write a script that first compares file sizes, then MD5 checksums (caching them, of course) and, if you're very anxious about losing data, bites the bullet and actually compares duplicate candidates byte for byte. If you have no additional knowledge about how the files came to be etc., it can't really be done much more efficiently.
回答6:
Save all the file names in an array. Then traverse the array. In each iteration, compare the file contents with the other file's contents by using the command md5sum
. If the MD5 is the same, then remove the file.
For example, if file b
is a duplicate of file a
, the md5sum
will be the same for both the files.
来源:https://stackoverflow.com/questions/2400574/remove-identical-files-in-unix