Remove duplicate lines without sorting [duplicate]

风流意气都作罢 提交于 2019-11-26 06:56:14

问题


This question already has an answer here:

  • How to delete duplicate lines in a file without sorting it in Unix? 8 answers

I have a utility script in Python:

#!/usr/bin/env python
import sys
unique_lines = []
duplicate_lines = []
for line in sys.stdin:
  if line in unique_lines:
    duplicate_lines.append(line)
  else:
    unique_lines.append(line)
    sys.stdout.write(line)
# optionally do something with duplicate_lines

This simple functionality (uniq without needing to sort first, stable ordering) must be available as a simple UNIX utility, mustn\'t it? Maybe a combination of filters in a pipe?

Reason for asking: needing this functionality on a system on which I cannot execute python from anywhere


回答1:


The UNIX Bash Scripting blog suggests:

awk '!x[$0]++'

This command is telling awk which lines to print. The variable $0 holds the entire contents of a line and square brackets are array access. So, for each line of the file, the node of the array x is incremented and the line printed if the content of that node was not (!) previously set.




回答2:


A late answer - I just ran into a duplicate of this - but perhaps worth adding...

The principle behind @1_CR's answer can be written more concisely, using cat -n instead of awk to add line numbers:

cat -n file_name | sort -uk2 | sort -nk1 | cut -f2-
  • Use cat -n to prepend line numbers
  • Use sort -u remove duplicate data
  • Use sort -n to sort by prepended number
  • Use cut to remove the line numbering



回答3:


Michael Hoffman's solution above is short and sweet. For larger files, a Schwartzian transform approach involving the addition of an index field using awk followed by multiple rounds of sort and uniq involves less memory overhead. The following snippet works in bash

awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'



回答4:


To remove duplicate from 2 files :

awk '!a[$0]++' file1.csv file2.csv



回答5:


Thanks 1_CR! I needed a "uniq -u" (remove duplicates entirely) rather than uniq (leave 1 copy of duplicates). The awk and perl solutions can't really be modified to do this, your's can! I may have also needed the lower memory use since I will be uniq'ing like 100,000,000 lines 8-). Just in case anyone else needs it, I just put a "-u" in the uniq portion of the command:

awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq -u --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'



回答6:


Now you can check out this small tool written in Rust: uq.

It performs uniqueness filtering without having to sort the input first, therefore can apply on continuous stream.




回答7:


I just wanted to remove all duplicates on following lines, not everywhere in the file. So I used:

awk '{
  if ($0 != PREVLINE) print $0;
  PREVLINE=$0;
}'



回答8:


the uniq command works in an alias even http://man7.org/linux/man-pages/man1/uniq.1.html



来源:https://stackoverflow.com/questions/11532157/remove-duplicate-lines-without-sorting

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!