Remove all lines from file with duplicate value in field, including the first occurrence

痴心易碎 提交于 2019-12-12 18:33:36

问题


I would like to remove all the lines in my data file that contain a value in column 2 that is repeated in column 2 in other lines.

I've sorted by the value in column 2, but can't figure out how to use uniq for just the values in one field as the values are not necessarily of the same length.

Alternately, I can remove lines with the duplicate using an awk one-liner like

awk -F"[,]" '!_[$2]++'

but this retains the line with the first incidence of the repeated value in col 2.

As an example, if my data is

a,b,c
c,b,a
d,e,f
h,i,j
j,b,h

I would like to remove ALL lines (including the first) where b occurs in the second column. Like this:

d,e,f
h,i,j

Thanks for any advice!!


回答1:


If the order is not important then the following should work:

awk -F, '
!seen[$2]++ {
    line[$2] = $0
}
END { 
    for(val in seen)
        if(seen[val]==1) 
          print line[val]
}' file

Output

h,i,j
d,e,f



回答2:


Solution with grep:

grep -v -E '\b,b,\b' text.txt

Content of the file:

$ cat text.txt 
a,b,c
c,b,a
d,e,f
h,i,j
j,b,h
a,n,b
b,c,f

$ grep -v -E '\b,b,\b' text.txt 
d,e,f
h,i,j
a,n,b
b,c,f

Hope it helps




回答3:


Some different awk:

awk -F, '
   BEGIN {f=0}
   FNR==NR {_[$2]++;next}
   f==0 {
      f=1
      for(j in _)if(_[j]>1)delete _[j]
   }
   $2 in _
' file file

Explanation

The awk passes through the file twice - that's why it appears twice at the end. On the first pass (when FNR==NR) I count the number of times each column 2 appears in array _[]. At the end of the first pass, I then delete all elements of _[] where that element has been seen more than once. Then, on the second pass, I print lines whose second field appears in _[].



来源:https://stackoverflow.com/questions/22308082/remove-all-lines-from-file-with-duplicate-value-in-field-including-the-first-oc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!