问题
Let say I have a file as below:
number 2 6 7 10 number 6 13
name1 A B C D name1 B E
name2 A B C D name2 B E
name3 B A D A name3 A F
name4 B A D A name4 A F
I wish to remove the entirely the same duplicate columns and the output file is as below:
number 2 6 7 10 13
name1 A B C D E
name2 A B C D E
name3 B A D A F
name4 B A D A F
I use sort
and uniq
command for lines but never know how to do for columns. Can anyone suggest a good way?
回答1:
Here is a way with awk that preserves the order
awk 'NR==1{for(i=1;i<=NF;i++)b[$i]++&&a[i]}{for(i in a)$i="";gsub(" +"," ")}1' file
Output
number 2 6 7 10 13
name1 A B C D E
name2 A B C D E
name3 B A D A F
name4 B A D A F
How it works
NR==1
If it is the first record
for(i=1;i<=NF;i++)
A loop over the fields, NF
is the number of fields
b[$i]++&&a[i]
If there has been more than one occurrence of $i
(The data contained in field i
), then add an element to array a with the key of i.
This next block is executed on all records(including record one).
{for(i in a)$i="";
For every key in a set the corresponding field to nothing.
gsub(" +"," ")
Remove extra spaces
1
Always evaluates to true so print all records.
回答2:
This Perl one-liner will do the trick:
perl -an -e '@cols = grep { !$seen{$F[$_]}++ } 0..$#F unless @cols; print join " ", @F[@cols],"\n"' inputfile
-a
splits each line of inputfile
into @F
. The first line of the file is used to construct the list of column indexes from left to right, keeping only those which are unseen. Next it prints the slice of @F
containing just those columns for each line.
回答3:
You can use awk:
NR == 1 {
for (ii = 1; ii <= NF; ii++) {
cols[$ii] = ii
}
for (ii in cols) {
printf "%s ", ii
}
print ""
}
NR > 1 {
for (ii in cols) {
printf "%s ", $cols[ii]
}
print ""
}
The above may reorder the columns, but a bit more effort could fix that if necessary.
回答4:
Removing duplicates lines can be done in just one awk
command:
awk '!a[$0]++'
This gets track of the amount of times a line appeared. Once a line appeared, a[this row]
equals 1, so when it comes again a[this row]
is already True and the !
negates the condition, so it is not printed.
In your case, you want to remove the duplicate columns. But what about creating a function transpose
to convert rows into columns and viceversa?
I already did it in my answer to Using bash to sort data horizontally:
transpose () {
awk '{for (i=1; i<=NF; i++) a[i,NR]=$i; max=(max<NF?NF:max)}
END {for (i=1; i<=max; i++)
{for (j=1; j<=NR; j++)
printf "%s%s", a[i,j], (j<NR?OFS:ORS)
}
}'
}
Then, it becomes trivial:
$ cat file | transpose | awk '!a[$0]++' | transpose
number 2 6 7 10 13
name1 A B C D E
name2 A B C D E
name3 B A D A F
name4 B A D A F
来源:https://stackoverflow.com/questions/28162428/remove-entirely-same-duplicate-columns-in-unix