Calculate mean of each column ignoring missing data with awk

匿名 (未验证) 提交于 2019-12-03 02:33:02

问题:

I have a large tab-separated data table with thousands of rows and dozens of columns and it has missing data marked as "na". For example,

na  0.93    na  0   na  0.51 1   1   na  1   na  1 1   1   na  0.97    na  1 0.92    1   na  1   0.01    0.34 

I would like to calculate the mean of each column, but making sure that the missing data are ignored in the calculation. For example, the mean of column 1 should be 0.97. I believe I could use awk but I am not sure how to construct the command to do this for all columns and account for missing data.

All I know how to do is to calculate the mean of a single column but it treats the missing data as 0 rather than leaving it out of the calculation.

awk '{sum+=$1} END {print sum/NR}' filename 

回答1:

This is obscure, but works for your example

awk '{for(i=1; i<=NF; i++){sum[i] += $i; if($i != "na"){count[i]+=1}}} END {for(i=1; i<=NF; i++){if(count[i]!=0){v = sum[i]/count[i]}else{v = 0}; if(i<NF){printf "%f\t",v}else{print v}}}' input.txt 

EDIT: Here is how it works:

awk '{for(i=1; i<=NF; i++){ #for each column         sum[i] += $i;       #add the sum to the "sum" array         if($i != "na"){     #if value is not "na"            count[i]+=1}     #increment the column "count"         }                   #endif      }                      #endfor     END {                    #at the end      for(i=1; i<=NF; i++){  #for each column         if(count[i]!=0){        #if the column count is not 0             v = sum[i]/count[i] #then calculate the column mean (here represented with "v")         }else{                  #else (if column count is 0)             v = 0               #then let mean be 0 (note: you can set this to be "na")         };                      #endif col count is not 0         if(i<NF){               #if the column is before the last column             printf "%f\t",v     #print mean + TAB         }else{                  #else (if it is the last column)             print v}            #print mean + NEWLINE         };                      #endif      }' input.txt               #endfor (note: input.txt is the input file) 

```



回答2:

A possible solution:

awk -F"\t" '{for(i=1; i <= NF; i++)                 {if($i == $i+0){sum[i]+=$i; denom[i] += 1;}}}             END{for(i=1; i<= NF; i++){line=line""sum[i]/(denom[i]?denom[i]:1)FS}                  print line}' inputFile 

The output for the given data:

0.973333    0.9825  0   0.7425  0.01    0.7125 

Note that the third column contains only "na" and the output is 0. If you want the output to be na, then change the END{...}-block to:

END{for(i=1; i<= NF; i++){line=line""(denom[i] ? sum[i]/denom[i]:"na")FS} print line}'



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!