how to conditionally filter rows in awk

夙愿已清 提交于 2021-02-08 12:01:25

问题


I am new to awk in linux. I have a large text file with 17 Million rows. The first column is subject ID and the second column is Age. Each subject may have multiple ages and I just want to filter the minimum age for each subject and print them in a separate text file. I am not sure if the subjects are ranked in first column from low to high... these are the first few rows:

ID          Age
16214497  36.000
16214497  63.000
16214727  63.000
16214781  71.000
16214781  79.000
16214792  67.000
16214860  79.000
16214862  62.000
16214874  61.000

回答1:


if the file is not sorted you need to keep the records in memory to find the min. If you need to sort, this might be better

$ sed 1d file         |   # remove header
  sort -k1,1 -k2n     |   # sort by ID, then by age, numerically
  uniq -w8            |   # find the first unique record by ID only
  sed '1iID  Min_Age' |   # insert back the new header
  column -t               # pretty print

ID        Min_Age
16214497  36.000
16214727  63.000
16214781  71.000
16214792  67.000
16214860  79.000
16214862  62.000
16214874  61.000



回答2:


Try (just awk with no pipes, using memory to retain values) :

$ awk '
    NR=1{print; next}                     # ¹
    arr[$1]==0 {arr[$1]=$2}               # ²
    ($2 < arr[$1]) {arr[$1]=$2}           # ³
    END{for (i in arr) {print i, arr[i]}} # ⁴
' file

The real command line :

(if multi-lines makes you fear)

awk 'NR=1{print; next} arr[$1]==0 {arr[$1]=$2} ($2 < arr[$1]) {arr[$1]=$2} END{for (i in arr) {print i, arr[i]}}' x.txt

(but works too with newlines and comments, up2u)

Comments :

  • ¹ print, then SKIP 1st line
  • ² If the value of arr[key] is null, then we feed arr[key] with 2th column, creating the array on the fly (first column as key).
  • ³ if second column is less than arr[key], then new value from second column is assigned to arr[key]
  • ⁴ @the end of treating all lines, we print the keys and values of the array

Output :

ID          Age
16214497 36.000
16214727 63.000
16214781 71.000
16214792 67.000
16214860 79.000
16214862 62.000
16214874 61.000



回答3:


$ tail +2 file | sort | awk '!seen[$1]++'
16214497  36.000
16214727  63.000
16214781  71.000
16214792  67.000
16214860  79.000
16214862  62.000
16214874  61.000


来源:https://stackoverflow.com/questions/48832648/how-to-conditionally-filter-rows-in-awk

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!