问题
I have the following tab-separated file:
A1 A1 0 0 2 1 1 1 1 1 1 1 2 1 1 1
A2 A2 0 0 2 1 1 1 1 1 1 1 1 1 1 1
A3 A3 0 0 2 2 1 1 2 2 1 1 1 1 1 1
A5 A5 0 0 2 2 1 1 1 1 1 1 1 2 1 1
The idea is to summarise the information between column 7 (included) and the end in a new column that is added at the end of the file.
To do so, these are the rules:
If the total number of “2”s in the row (between column 7 and the end) is 0: add “1 1” to the new last column
If the total number of “2”s in the row (between column 7 and the end) is 1: add “1 2” to the new last column
If the total number of “2”s in the row (between column 7 and the end) is 2 or more: add “2 2” to the new last column
I started to extract the columns I want to work on using the command:
awk '{for (i = 7; i <= NF; i++) printf $i " "; print ""}' myfile.ped > tmp_myfile.txt
Then I count the number of occurrence in each row using:
sed 's/[^2]//g' tmp_myfile.txtt | awk '{print NR, length }' > tmp_occurences.txt
Which outputs:
1 1
2 0
3 2
4 1
Then my idea was to write a for loop that loops through the lines to add the new summary column. I was thinking in this kind of structure, based on what I found here: http://www.thegeekstuff.com/2010/06/bash-if-statement-examples:
while read line ;
do
set $line
If ["$2"==0]
then
$3=="1 1"
elif ["$2"==1 ]
then
$3=="1 2”
elif ["$2">=2 ]
then
$3==“2 2”
else
print ["error"]
fi
done < tmp_occurences.txt
But I am stuck here. Do I have to create the new column before starting the loop? Am I going in the right direction?
Ideally, the final output (after merging the first 6 columns from the initial file and the summary column) would be:
A1 A1 0 0 2 1 1 2
A2 A2 0 0 2 1 1 1
A3 A3 0 0 2 2 2 2
A5 A5 0 0 2 2 1 2
Thank you for your help!
回答1:
Using gnu-awk you can do:
awk -v OFS='\t' '{
c=0;
for (i=7; i<=NF; i++)
if ($i==2)
c++
if (c==0)
s="1 1"
else if (c==1)
s="1 2"
else
s="2 2"
NF=6
print $0, s
}' file
A1 A1 0 0 2 1 1 2
A2 A2 0 0 2 1 1 1
A3 A3 0 0 2 2 2 2
A5 A5 0 0 2 2 1 2
PS: If not using gnu-awk you can use:
awk -v OFS='\t' '{c=0; for (i=7; i<=NF; i++) {if ($i==2) c++; $i=""} if (c==0) s="1 1"; else if (c==1) s="1 2"; else s="2 2"; NF=6; print $0, s}' file
回答2:
With GNU awk for the 3rd arg to match():
$ awk '{match($0,/((\S+\s+){6})(.*)/,a); c=gsub(2,2,a[3]); print a[1] (c>1?2:1), (c>0?2:1)}' file
A1 A1 0 0 2 1 1 2
A2 A2 0 0 2 1 1 1
A3 A3 0 0 2 2 2 2
A5 A5 0 0 2 2 1 2
With other awks you'd replace \S/\s
with [^[:space:]]/[[:space:]]
and use substr()
instead of a[]
.
回答3:
We can keep the format by using gensub()
and capturing groups: we capture the 6 first fields and replace with them + the calculated values:
awk '{for (i=7; i<=NF; i++) {
if ($i==2)
twos+=1 # count number of 2's from 7th to last field
}
f7=1; f8=0 # set 7th and 8th fields's default value
if (twos)
f8=2 # set 8th = 2 if sum is > 0
if (twos>1)
f7=2 # set 7th = 2 if sum is > 1
$0=gensub(/^((\S+\s*){6}).*/,"\\1 " f7 FS f8, 1) # perform the replacement
twos=0 # reset counter
}1' file
As a one-liner:
$ awk '{for (i=7; i<=NF; i++) {if ($i==2) twos+=1} f7=1; f8=0; if (twos) f8=2; if (twos>1) f7=2; $0=gensub(/^((\S+\s*){6}).*/,"\\1 " f7 FS f8,1); twos=0}1' a
A1 A1 0 0 2 1 1 2
A2 A2 0 0 2 1 1 0
A3 A3 0 0 2 2 2 2
A5 A5 0 0 2 2 1 2
回答4:
$ cat > test.awk
{
for(i=1;i<=NF;i++) { # for every field
if(i<7)
printf "%s%s", $i,OFS # only output the first 6
else a[$i]++ # count the values of the of the fields
}
print (a[2]>1?"2 2":(a[2]==1?"1 2":"1 1")) # output logic
delete a # reset a for next record
}
$ awk -f test.awk test
A1 A1 0 0 2 1 1 2
A2 A2 0 0 2 1 1 1
A3 A3 0 0 2 2 2 2
A5 A5 0 0 2 2 1 2
Borrowing some ideas from @anubhava's solution above:
$ cat > another.awk
{
for(i=7;i<=NF;i++)
a[$i]++ # count 2s
NF=6 # truncate $0
print $0 OFS (a[2]<2?"1 "(a[2]?"2":"1"):"2 2") # print $0 AND 1 AND 1 OR 2 OR 2 AND 2
delete a # reset a for next record
}
来源:https://stackoverflow.com/questions/39164158/bash-summarising-information-from-several-fields-in-unique-field-using-loop-an