remove entirely same duplicate columns in unix

瘦欲@ 提交于 2020-07-10 10:42:45

问题


Let say I have a file as below:

number 2 6 7 10 number 6 13  
name1 A B C D name1 B E   
name2 A B C D name2 B E  
name3 B A D A name3 A F  
name4 B A D A name4 A F  

I wish to remove the entirely the same duplicate columns and the output file is as below:

number 2 6 7 10 13  
name1 A B C D E   
name2 A B C D E  
name3 B A D A F  
name4 B A D A F  

I use sort and uniq command for lines but never know how to do for columns. Can anyone suggest a good way?


回答1:


Here is a way with awk that preserves the order

awk 'NR==1{for(i=1;i<=NF;i++)b[$i]++&&a[i]}{for(i in a)$i="";gsub(" +"," ")}1' file

Output

number 2 6 7 10 13  
name1 A B C D E   
name2 A B C D E  
name3 B A D A F  
name4 B A D A F  

How it works

NR==1

If it is the first record

for(i=1;i<=NF;i++)

A loop over the fields, NF is the number of fields

b[$i]++&&a[i]

If there has been more than one occurrence of $i (The data contained in field i), then add an element to array a with the key of i.

This next block is executed on all records(including record one).

{for(i in a)$i="";

For every key in a set the corresponding field to nothing.

gsub(" +"," ")

Remove extra spaces

1

Always evaluates to true so print all records.




回答2:


This Perl one-liner will do the trick:

perl -an -e '@cols = grep { !$seen{$F[$_]}++ } 0..$#F unless @cols; print join " ", @F[@cols],"\n"' inputfile

-a splits each line of inputfile into @F. The first line of the file is used to construct the list of column indexes from left to right, keeping only those which are unseen. Next it prints the slice of @F containing just those columns for each line.




回答3:


You can use awk:

NR == 1 {
  for (ii = 1; ii <= NF; ii++) {
    cols[$ii] = ii
  }
  for (ii in cols) {
    printf "%s ", ii
  }
  print ""
}

NR > 1 {
  for (ii in cols) {
    printf "%s ", $cols[ii]
  }
  print ""
}

The above may reorder the columns, but a bit more effort could fix that if necessary.




回答4:


Removing duplicates lines can be done in just one awk command:

awk '!a[$0]++'

This gets track of the amount of times a line appeared. Once a line appeared, a[this row] equals 1, so when it comes again a[this row] is already True and the ! negates the condition, so it is not printed.

In your case, you want to remove the duplicate columns. But what about creating a function transpose to convert rows into columns and viceversa?

I already did it in my answer to Using bash to sort data horizontally:

transpose () {
  awk '{for (i=1; i<=NF; i++) a[i,NR]=$i; max=(max<NF?NF:max)}
        END {for (i=1; i<=max; i++)
              {for (j=1; j<=NR; j++) 
                  printf "%s%s", a[i,j], (j<NR?OFS:ORS)
              }
        }'
}

Then, it becomes trivial:

$ cat file | transpose | awk '!a[$0]++' | transpose
number 2 6 7 10 13
name1 A B C D E
name2 A B C D E
name3 B A D A F
name4 B A D A F


来源:https://stackoverflow.com/questions/28162428/remove-entirely-same-duplicate-columns-in-unix

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!