问题
The following code prints me a long list of files with hashes and file names
md5sum *.java
I have tried unsuccessfully to list the lines where identical hashes occur, so that I can then remove identical files.
How can you filter and delete identical files which have same content?
回答1:
This should work:
md5sum *.java | sort | uniq -d -w32
This tells uniq to only compare the first 32 character, which is only the md5 sum, not the filenames.
EDIT: If -w isn't available, try:
md5sum *.java | awk '{print $1}' | sort | uniq -d
The downside is that you won't know which files have these duplicate checksums... anyway, if there aren't too much checksums, you can use
md5sum *.java | grep 0bee89b07a248e27c83fc3d5951213c1
to get the filenames afterwards (the checksum above is just an example). I'm sure there's a way to do all this in a shell script, too.
回答2:
fdupes and less view on duplicates
Use fdupes which is a commandline program such as
fdupes -r /home/masi/Documents/ > /tmp/1
less -M +Gg /tmp/1
which finds all duplicates and stores them in file in temp.
The less command shows you the line position of all lines and your proceeding as percentage.
I found fdupes from this answer and its clear Wikipedia article here.
You can install it by homebrew in OSX and by apt-get in Linux.
Use fdupes interactively with possible deletes
Run
fdupes -rd /home/masi/Documents
which let's you choose which copy to delete or not, example view of the interactive work:
Set 4 of 2664, preserve files [1 - 2, all]: all
[+] /home/masi/Documents/Exercise 10 - 1.4.2015/task.bib
[+] /home/masi/Documents/Exercise 9 - 16.3.2015/task.bib
[1] /home/masi/Documents/Celiac_disease/jcom_jun02_celiac.pdf
[2] /home/masi/Documents/turnerWhite/jcom_jun02_celiac.pdf
Set 5 of 2664, preserve files [1 - 2, all]: 2
[-] /home/masi/Documents/Celiac_disease/jcom_jun02_celiac.pdf
[+] /home/masi/Documents/turnerWhite/jcom_jun02_celiac.pdf
where you see that I have 2664 duplicates. It would be nice to have some static file which would save the settings about my wanted duplicates; I opened a thread about this here. For instance, I have same bib -files in some exercises and homework so do not ask second time when the user wants the duplicate.
回答3:
Even beter:
md5sum *.java | sort | uniq -d
That only prints the duplicate lines.
回答4:
This lists all the files, putting a blank line between duplicates:
$ md5sum *.txt \
| sort \
| perl -pe '($y)=split; print "\n" unless $y eq $x; $x=$y'
05aa3dad11b2d97568bc506a7080d4a3 b.txt
2a517c8a78f1e1582b4ce25e6a8e4953 n.txt
e1254aebddc54f1cbc9ed2eacce91f28 a.txt
e1254aebddc54f1cbc9ed2eacce91f28 k.txt
e1254aebddc54f1cbc9ed2eacce91f28 p.txt
$
To only print 1st of each group:
$ md5sum *.txt | sort | perl -ne '($y,$f)=split; print "$f\n" unless $y eq $x; $x=$y'
b.txt
n.txt
a.txt
$
if you're brave, change the "unless" to "if" and then
$ rm `md5sum ...`
to delete all but the first of each group
来源:https://stackoverflow.com/questions/621708/checking-duplicates-in-terminal