问题
I have a 12Gb file of combined hash lists. I need to find the duplicates in it but I've been having some issues.
Some 920 (uniq'd) lists were merged using cat *.txt > _uniq_combined.txt
resulting in a huge list of hashes. Once merged, the final list WILL contain duplicates.
I thought I had it figured out with awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt && say finished ya jabroni
awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt
results in a file with a size of 4574766572
bytes.
I was told that a file that large is not possible and to try again.
sort _uniq_combined.txt | uniq -c | grep -v '^ *1 ' > _SORTEDC_duplicates.txt
results in a file with a size of 1624577643
bytes. Significantly smaller.
sort _uniq_combined.txt | uniq -d > _UNIQ_duplicates.txt
results in a file with a size of 1416298458
bytes.
I'm beginning to think I don't know what these commands do since the file sizes should be the same.
Again, the goal is to look through a giant list and save instances of hashes seen more than once. Which (if any) of these results is correct? I thought they all do the same thing.
回答1:
sort
is designed especially to cope with huge files too. You could do:
cat *.txt | sort >all_sorted
uniq all_sorted >unique_sorted
sdiff -sld all_sorted unique_sorted | uniq >all_duplicates
回答2:
The sort
command should work fine with a 12 GB file. And uniq
will output just duplicated lines if you specify the -d or -D options. That is:
sort all_combined > all_sorted
uniq -d all_sorted > duplicates
or
uniq -D all_sorted > all_duplicates
The -d option displays one line for each duplicated element. So if "foo" occurs 12 times, it will display "foo" one time. -D prints all duplicates.
uniq --help
will give you a bit more information.
回答3:
Maybe if you split
that big file into smaller files, sort --unique
ed them out and tried to merge them with sort --merge
:
$ cat > test1
1
1
2
2
3
3
$ cat > test2
2
3
3
4
4
$ sort -m -u test1 test2
1
2
3
4
I would imagine merging sorted files would not happen in memory?
回答4:
I think your awk
script is incorrect and your uniq -c
-command includes the counts of occurrences of duplicates and sort _uniq_combined.txt | uniq -d
is the correct thing :) .
Note that you could have directly sort *.txt > sorted_hashes
or sort *.txt -o sorted_hashes
.
If you have just two files at hand consider using comm
(info coreutils
to the rescue), which can give you columned-output of "lines just in first file", "lines just in second file", "lines in boths files". If you need just some of these columns you can suppress the others with options to comm
. Or use the generated output as a base and continue working on it using cut
, like cut -f 1 my_three_colum_file
to get the first column.
来源:https://stackoverflow.com/questions/39221524/how-to-sort-out-duplicates-from-a-massive-list-using-sort-uniq-or-awk