问题
I am new to awk in linux. I have a large text file with 17 Million rows. The first column is subject ID and the second column is Age. Each subject may have multiple ages and I just want to filter the minimum age for each subject and print them in a separate text file. I am not sure if the subjects are ranked in first column from low to high... these are the first few rows:
ID Age
16214497 36.000
16214497 63.000
16214727 63.000
16214781 71.000
16214781 79.000
16214792 67.000
16214860 79.000
16214862 62.000
16214874 61.000
回答1:
if the file is not sorted you need to keep the records in memory to find the min. If you need to sort, this might be better
$ sed 1d file | # remove header
sort -k1,1 -k2n | # sort by ID, then by age, numerically
uniq -w8 | # find the first unique record by ID only
sed '1iID Min_Age' | # insert back the new header
column -t # pretty print
ID Min_Age
16214497 36.000
16214727 63.000
16214781 71.000
16214792 67.000
16214860 79.000
16214862 62.000
16214874 61.000
回答2:
Try (just awk with no pipes, using memory to retain values) :
$ awk '
NR=1{print; next} # ¹
arr[$1]==0 {arr[$1]=$2} # ²
($2 < arr[$1]) {arr[$1]=$2} # ³
END{for (i in arr) {print i, arr[i]}} # ⁴
' file
The real command line :
(if multi-lines makes you fear)
awk 'NR=1{print; next} arr[$1]==0 {arr[$1]=$2} ($2 < arr[$1]) {arr[$1]=$2} END{for (i in arr) {print i, arr[i]}}' x.txt
(but works too with newlines and comments, up2u)
Comments :
- ¹ print, then SKIP 1st line
- ² If the value of arr[key] is null, then we feed arr[key] with 2th column, creating the array on the fly (first column as key).
- ³ if second column is less than arr[key], then new value from second column is assigned to arr[key]
- ⁴ @the end of treating all lines, we print the keys and values of the array
Output :
ID Age
16214497 36.000
16214727 63.000
16214781 71.000
16214792 67.000
16214860 79.000
16214862 62.000
16214874 61.000
回答3:
$ tail +2 file | sort | awk '!seen[$1]++'
16214497 36.000
16214727 63.000
16214781 71.000
16214792 67.000
16214860 79.000
16214862 62.000
16214874 61.000
来源:https://stackoverflow.com/questions/48832648/how-to-conditionally-filter-rows-in-awk