awk/bash remove lines with an unique id and keep the lines that has the max/min value in a column under the same ID

问题

If we have the following input and would like to firstly, detect if the cpd_number ($2) is unique in the file, remove the whole row. In this case, the line with "cpd-6666666" should be removed.
Secondly, if there are multiple lines kept under the same "cpd_number", only prints out the two lines which has the max and min "log_ratio" ($17).

targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-1.176091259,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.602059991,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

The ideal output should be

targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

I was trying to use awk count function but it doesn't seem to work quite well. Could any guru kindly give some comments? Thanks!

回答1:

While not as concise as the perl answer, here's a executable awk file:

#!/usr/bin/awk -f

BEGIN { FS="," }

NR==1 {print; next}

{
  a[$2,$17]=$0

  h=high[$2]
  high[$2]=$17>h || h=="" ? $17 : h

  l=low[$2]
  low[$2]=$17<l || l=="" ? $17 : l
}

END {
  for(i in high) {
    if(low[i]!=high[i]) {
      print a[i,high[i]]
      print a[i,low[i]]
    }
  }
}

which:

Prints the header row
Stores whole lines in a, and highs and lows for each key
In the END, walks the high array and prints lines where the high and low are different by retrieving them from a

回答2:

It will be easier if you can give a try with perl:

perl -MList::Util=max,min -F, -lane '
    print if $.==1;
    $line{$F[1]}{$F[16]}=$_; 
    $count{$F[1]}++ 
}{
    for $key (keys %count) {
        next if $count{$key}==1; 
        print $line{$key}{ max keys %{$line{$key}} };
        print $line{$key}{ min keys %{$line{$key}} }
     }
' file

Output:

targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

Explanation:

Use a core module (comes with every perl binary) to find min and max values of $17.
Using -F split the lines on ,.
-a option splits the line on delimiter in to an @F array.
If it is our first line, we print it
We create a hash of hash with second column as key and 17th column as second tier key. The value is entire line
We create a counter hash to keep counts of second column
In the END block we iterate through our keys from counter hash. If the count is one we skip it.
If the counter is greater than one we print the min and max of 17th column.

来源：https://stackoverflow.com/questions/24543486/awk-bash-remove-lines-with-an-unique-id-and-keep-the-lines-that-has-the-max-min

标签

bash

awk

count

max

min