How to sort out duplicates from a massive list using sort, uniq or awk?

问题

I have a 12Gb file of combined hash lists. I need to find the duplicates in it but I've been having some issues.

Some 920 (uniq'd) lists were merged using cat *.txt > _uniq_combined.txt resulting in a huge list of hashes. Once merged, the final list WILL contain duplicates.

I thought I had it figured out with awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt && say finished ya jabroni

awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt results in a file with a size of 4574766572 bytes.

I was told that a file that large is not possible and to try again.

sort _uniq_combined.txt | uniq -c | grep -v '^ *1 ' > _SORTEDC_duplicates.txt results in a file with a size of 1624577643 bytes. Significantly smaller.

sort _uniq_combined.txt | uniq -d > _UNIQ_duplicates.txt results in a file with a size of 1416298458 bytes.

I'm beginning to think I don't know what these commands do since the file sizes should be the same.

Again, the goal is to look through a giant list and save instances of hashes seen more than once. Which (if any) of these results is correct? I thought they all do the same thing.

回答1:

sort is designed especially to cope with huge files too. You could do:

cat *.txt | sort >all_sorted 
uniq all_sorted >unique_sorted
sdiff -sld all_sorted unique_sorted | uniq >all_duplicates

回答2:

The sort command should work fine with a 12 GB file. And uniq will output just duplicated lines if you specify the -d or -D options. That is:

sort all_combined > all_sorted
uniq -d all_sorted > duplicates

uniq -D all_sorted > all_duplicates

The -d option displays one line for each duplicated element. So if "foo" occurs 12 times, it will display "foo" one time. -D prints all duplicates.

uniq --help will give you a bit more information.

回答3:

Maybe if you split that big file into smaller files, sort --uniqueed them out and tried to merge them with sort --merge:

$ cat > test1
1
1
2
2
3
3
$ cat > test2
2
3
3
4
4
$ sort -m -u test1 test2
1
2
3
4

I would imagine merging sorted files would not happen in memory?

回答4:

I think your awk script is incorrect and your uniq -c-command includes the counts of occurrences of duplicates and sort _uniq_combined.txt | uniq -d is the correct thing :) .

Note that you could have directly sort *.txt > sorted_hashes or sort *.txt -o sorted_hashes.

If you have just two files at hand consider using comm (info coreutils to the rescue), which can give you columned-output of "lines just in first file", "lines just in second file", "lines in boths files". If you need just some of these columns you can suppress the others with options to comm. Or use the generated output as a base and continue working on it using cut, like cut -f 1 my_three_colum_file to get the first column.

来源：https://stackoverflow.com/questions/39221524/how-to-sort-out-duplicates-from-a-massive-list-using-sort-uniq-or-awk

标签

bash

sorting

awk

duplicates