Remove identical files in UNIX

妖精的绣舞 提交于 2020-02-03 13:25:20

问题


I'm dealing with a large amount (30,000) files of about 10MB in size. Some of them (I estimate 2%) are actually duplicated, and I need to keep only a copy for every duplicated pair (or triplet). Would you suggest me an efficient way to do that? I'm working on unix.


回答1:


There is an existing tool for this: fdupes

Restoring a solution from an old deleted answer.




回答2:


you can try this snippet to get all duplicates first before removing.

find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in  seen)){seen[$1]=$2}' 



回答3:


I would write a script to create a hash of every file. You could store the hashes in a set, iterate over the files, and where a file hashes to a value already found in the set, delete the file. This would be trivial to do in Python, for example.

For 30,000 files, at 64 bytes per hash table entry, you're only looking at about 200 megabytes.




回答4:


Find possible duplicate files:

find DIR -type f -exec sha1sum "{}" \; | sort | uniq -d -w40

Now you can use cmp to check that the files are really identical.




回答5:


Write a script that first compares file sizes, then MD5 checksums (caching them, of course) and, if you're very anxious about losing data, bites the bullet and actually compares duplicate candidates byte for byte. If you have no additional knowledge about how the files came to be etc., it can't really be done much more efficiently.




回答6:


Save all the file names in an array. Then traverse the array. In each iteration, compare the file contents with the other file's contents by using the command md5sum. If the MD5 is the same, then remove the file.

For example, if file b is a duplicate of file a, the md5sum will be the same for both the files.



来源:https://stackoverflow.com/questions/2400574/remove-identical-files-in-unix

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!