remove entirely same duplicate columns in unix

问题

Let say I have a file as below:

number 2 6 7 10 number 6 13  
name1 A B C D name1 B E   
name2 A B C D name2 B E  
name3 B A D A name3 A F  
name4 B A D A name4 A F

I wish to remove the entirely the same duplicate columns and the output file is as below:

number 2 6 7 10 13  
name1 A B C D E   
name2 A B C D E  
name3 B A D A F  
name4 B A D A F

I use sort and uniq command for lines but never know how to do for columns. Can anyone suggest a good way?

回答1:

Here is a way with awk that preserves the order

awk 'NR==1{for(i=1;i<=NF;i++)b[$i]++&&a[i]}{for(i in a)$i="";gsub(" +"," ")}1' file

Output

number 2 6 7 10 13  
name1 A B C D E   
name2 A B C D E  
name3 B A D A F  
name4 B A D A F

How it works

NR==1

If it is the first record

for(i=1;i<=NF;i++)

A loop over the fields, NF is the number of fields

b[$i]++&&a[i]

If there has been more than one occurrence of $i (The data contained in field i), then add an element to array a with the key of i.

This next block is executed on all records(including record one).

{for(i in a)$i="";

For every key in a set the corresponding field to nothing.

gsub(" +"," ")

Remove extra spaces

Always evaluates to true so print all records.

回答2:

This Perl one-liner will do the trick:

perl -an -e '@cols = grep { !$seen{$F[$_]}++ } 0..$#F unless @cols; print join " ", @F[@cols],"\n"' inputfile

-a splits each line of inputfile into @F. The first line of the file is used to construct the list of column indexes from left to right, keeping only those which are unseen. Next it prints the slice of @F containing just those columns for each line.

回答3:

You can use awk:

NR == 1 {
  for (ii = 1; ii <= NF; ii++) {
    cols[$ii] = ii
  }
  for (ii in cols) {
    printf "%s ", ii
  }
  print ""
}

NR > 1 {
  for (ii in cols) {
    printf "%s ", $cols[ii]
  }
  print ""
}

The above may reorder the columns, but a bit more effort could fix that if necessary.

回答4:

Removing duplicates lines can be done in just one awk command:

awk '!a[$0]++'

This gets track of the amount of times a line appeared. Once a line appeared, a[this row] equals 1, so when it comes again a[this row] is already True and the ! negates the condition, so it is not printed.

In your case, you want to remove the duplicate columns. But what about creating a function transpose to convert rows into columns and viceversa?

I already did it in my answer to Using bash to sort data horizontally:

transpose () {
  awk '{for (i=1; i<=NF; i++) a[i,NR]=$i; max=(max<NF?NF:max)}
        END {for (i=1; i<=max; i++)
              {for (j=1; j<=NR; j++) 
                  printf "%s%s", a[i,j], (j<NR?OFS:ORS)
              }
        }'
}

Then, it becomes trivial:

$ cat file | transpose | awk '!a[$0]++' | transpose
number 2 6 7 10 13
name1 A B C D E
name2 A B C D E
name3 B A D A F
name4 B A D A F

来源：https://stackoverflow.com/questions/28162428/remove-entirely-same-duplicate-columns-in-unix

标签

sorting

unix

awk