Checking duplicates in terminal?

拥有回忆 提交于 2019-12-19 09:53:45

问题


The following code prints me a long list of files with hashes and file names

md5sum *.java

I have tried unsuccessfully to list the lines where identical hashes occur, so that I can then remove identical files.

How can you filter and delete identical files which have same content?


回答1:


This should work:

md5sum *.java | sort | uniq -d -w32

This tells uniq to only compare the first 32 character, which is only the md5 sum, not the filenames.

EDIT: If -w isn't available, try:

md5sum *.java | awk '{print $1}' | sort | uniq -d

The downside is that you won't know which files have these duplicate checksums... anyway, if there aren't too much checksums, you can use

md5sum *.java | grep 0bee89b07a248e27c83fc3d5951213c1

to get the filenames afterwards (the checksum above is just an example). I'm sure there's a way to do all this in a shell script, too.




回答2:


fdupes and less view on duplicates

Use fdupes which is a commandline program such as

fdupes -r /home/masi/Documents/ > /tmp/1 
less -M +Gg /tmp/1

which finds all duplicates and stores them in file in temp. The less command shows you the line position of all lines and your proceeding as percentage. I found fdupes from this answer and its clear Wikipedia article here. You can install it by homebrew in OSX and by apt-get in Linux.

Use fdupes interactively with possible deletes

Run

fdupes -rd /home/masi/Documents

which let's you choose which copy to delete or not, example view of the interactive work:

Set 4 of 2664, preserve files [1 - 2, all]: all

   [+] /home/masi/Documents/Exercise 10 - 1.4.2015/task.bib
   [+] /home/masi/Documents/Exercise 9 - 16.3.2015/task.bib

[1] /home/masi/Documents/Celiac_disease/jcom_jun02_celiac.pdf
[2] /home/masi/Documents/turnerWhite/jcom_jun02_celiac.pdf

Set 5 of 2664, preserve files [1 - 2, all]: 2

   [-] /home/masi/Documents/Celiac_disease/jcom_jun02_celiac.pdf
   [+] /home/masi/Documents/turnerWhite/jcom_jun02_celiac.pdf

where you see that I have 2664 duplicates. It would be nice to have some static file which would save the settings about my wanted duplicates; I opened a thread about this here. For instance, I have same bib -files in some exercises and homework so do not ask second time when the user wants the duplicate.




回答3:


Even beter:

md5sum *.java | sort | uniq -d

That only prints the duplicate lines.




回答4:


This lists all the files, putting a blank line between duplicates:

$ md5sum *.txt \ 
  | sort       \
  | perl -pe '($y)=split; print "\n" unless $y eq $x; $x=$y'

05aa3dad11b2d97568bc506a7080d4a3  b.txt
2a517c8a78f1e1582b4ce25e6a8e4953  n.txt
e1254aebddc54f1cbc9ed2eacce91f28  a.txt
e1254aebddc54f1cbc9ed2eacce91f28  k.txt
e1254aebddc54f1cbc9ed2eacce91f28  p.txt
$

To only print 1st of each group:

$ md5sum *.txt | sort | perl -ne '($y,$f)=split; print "$f\n" unless $y eq $x; $x=$y'
b.txt
n.txt
a.txt
$ 

if you're brave, change the "unless" to "if" and then

$ rm `md5sum ...`

to delete all but the first of each group



来源:https://stackoverflow.com/questions/621708/checking-duplicates-in-terminal

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!