awk to remove duplicate rows totally based on a particular column value

自闭症网瘾萝莉.ら 提交于 2019-12-02 02:10:08

Using awk to filter-out duplicate lines and print those lines which occur exactly once.

awk '{k=($2 FS $5 FS $6 FS $4)} {a[$4]++;b[$4]=k}END{for(x in a)if(a[x]==1)print b[x]}' input_file

SNP_A_30018682 T G 30018682
SNP_A_30018695 G C 30018695
SNP_A_30018679 T G 30018679

The idea is to:-

  1. Store all unique $4 entries in a an array(a) and maintain a counter for that in array b
  2. Print the array for those entries which occur exactly once.

Using command substitution: First print only unique columns in fourth field and then grep those columns.

grep "$(echo  "$(awk '{print $4}' inputfile.txt)" |sort |uniq -u)" inputfile.txt
6   SNP_A_30018679  0   30018679    T   G
6   SNP_A_30018682  0   30018682    T   G
6   SNP_A_30018695  0   30018695    G   C

Note: add awk '{NF=4}1' at the end of the command, if you wist to print first four columns. Of course you can change the number of columns by changing value of $4 and NF=4.

Since your 'key' is fixed width, then uniq has a -w to check on it.

sort -k4,4 example.txt | uniq -u -f 3 -w 8  > uniq.txt
$ awk 'NR==FNR{c[$4]++;next} c[$4]<2' file file
6   SNP_A_30018679  0   30018679    T   G
6   SNP_A_30018682  0   30018682    T   G
6   SNP_A_30018695  0   30018695    G   C

Another in awk:

$ awk '{$1=$1; a[$4]=a[$4] $0} END{for(i in a) if(gsub(FS,FS,a[i])==5) print a[i]}' file
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C

Catenate to array using $4 as key. If there are more than 5 field separators, duplicates were catenated and will not be printed.

And yet an another version in awk. It expects the file to be sorted on the fourth field. It won't store all lines in memory, only the keys (this probably could be dealt with also since the key field must be sorted, may be fixed later) and runs in one go:

$ cat ananother.awk
++seen[p[4]]==1 && NR>1 && p[4]!=$4 {  # seen count must be 1 and
    print prev                         # this and previous $4 must differ
    delete seen                        # is this enough really?
}
{ 
    q=p[4]                             # previous previous $4 for END
    prev=$0                            # previous is stored for printing
    split($0,p)                        # to get previous $4
} 
END {                                  # last record control
    if(++seen[$4]==1 && q!=$4) 
        print $0
}

Run:

$ sort -k4,4 file | awk -f ananother.awk

A simpler way to achieve this,

cat file.csv | cut -d, -f3,5,6,1 | sort -u > uniq.txt

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!