问题
If we have the following input and would like to firstly, detect if the cpd_number ($2) is unique in the file, remove the whole row. In this case, the line with "cpd-6666666" should be removed.
Secondly, if there are multiple lines kept under the same "cpd_number", only prints out the two lines which has the max and min "log_ratio" ($17).
targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-1.176091259,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.602059991,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
The ideal output should be
targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
I was trying to use awk count function but it doesn't seem to work quite well. Could any guru kindly give some comments? Thanks!
回答1:
While not as concise as the perl answer, here's a executable awk file:
#!/usr/bin/awk -f
BEGIN { FS="," }
NR==1 {print; next}
{
a[$2,$17]=$0
h=high[$2]
high[$2]=$17>h || h=="" ? $17 : h
l=low[$2]
low[$2]=$17<l || l=="" ? $17 : l
}
END {
for(i in high) {
if(low[i]!=high[i]) {
print a[i,high[i]]
print a[i,low[i]]
}
}
}
which:
- Prints the header row
- Stores whole lines in
a
, andhighs
andlows
for each key - In the
END
, walks thehigh
array and prints lines where thehigh
andlow
are different by retrieving them froma
回答2:
It will be easier if you can give a try with perl
:
perl -MList::Util=max,min -F, -lane '
print if $.==1;
$line{$F[1]}{$F[16]}=$_;
$count{$F[1]}++
}{
for $key (keys %count) {
next if $count{$key}==1;
print $line{$key}{ max keys %{$line{$key}} };
print $line{$key}{ min keys %{$line{$key}} }
}
' file
Output:
targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
Explanation:
- Use a core module (comes with every
perl
binary) to find min and max values of $17. - Using
-F
split the lines on,
. -a
option splits the line on delimiter in to an@F
array.- If it is our first line, we print it
- We create a hash of hash with second column as key and 17th column as second tier key. The value is entire line
- We create a counter hash to keep counts of second column
- In the
END
block we iterate through our keys from counter hash. If the count is one we skip it. - If the counter is greater than one we print the min and max of 17th column.
来源:https://stackoverflow.com/questions/24543486/awk-bash-remove-lines-with-an-unique-id-and-keep-the-lines-that-has-the-max-min