median of column with awk

[亡魂溺海] 提交于 2020-04-08 02:00:10

问题


How can I use AWK to compute the median of a column of numerical data?

I can think of a simple algorithm but I can't seem to program it:

What I have so far is:

sort | awk 'END{print NR}' 

And this gives me the number of elements in the column. I'd like to use this to print a certain row (NR/2). If NR/2 is not an integer, then I round up to the nearest integer and that is the median, otherwise I take the average of (NR/2)+1 and (NR/2)-1.


回答1:


This awk program assumes one column of numerically sorted data:

#/usr/bin/env awk
{
    count[NR] = $1;
}
END {
    if (NR % 2) {
        print count[(NR + 1) / 2];
    } else {
        print (count[(NR / 2)] + count[(NR / 2) + 1]) / 2.0;
    }
}

Sample usage:

sort -n data_file | awk -f median.awk



回答2:


With awk you have to store the values in an array and compute the median at the end, assuming we look at the first column:

sort -n file | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'

Sure, for real median computation do the rounding as described in the question:

sort -n file | awk ' { a[i++]=$1; }
    END { x=int((i+1)/2); if (x < (i+1)/2) print (a[x-1]+a[x])/2; else print a[x-1]; }'



回答3:


OK, just saw this topic and thought I could add my two cents, since I looked for something similar in the past. Even though the title says awk, all the answers make use of sort as well. Calculating the median for a column of data can be easily accomplished with datamash:

> seq 10 | datamash median 1
5.5

Note that sort is not needed, even if you have an unsorted column:

> seq 10 | gshuf | datamash median 1
5.5

The documentation gives all the functions it can perform, and good examples as well for files with many columns. Anyway, it has nothing to do with awk, but I think datamash is of great help in cases like this, and could also be used in conjunction with awk. Hope it helps somebody!




回答4:


This AWK based answer to a similar question on unix.stackexchange.com gives the same results as Excel for calculating the median.




回答5:


If you have an array to compute median from (contains one-liner of Johnsyweb solution):

array=(5 6 4 2 7 9 3 1 8) # numbers 1-9
IFS=$'\n'
median=$(awk '{arr[NR]=$1} END {if (NR%2==1) print arr[(NR+1)/2]; else print (arr[NR/2]+arr[NR/2+1])/2}' <<< sort <<< "${array[*]}")
unset IFS


来源:https://stackoverflow.com/questions/6166375/median-of-column-with-awk

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!