How to compute percentiles from frequency table?

问题

I have CSV file:

fr id
 1 10000152
 1 10000212
 1 10000847
 1 10001018
 2 10001052
 2 10001246
14 10001908
...........

This is a frequency table, where id is integer variable and fr is number of occurrences given value. File is sorted ascending by value. I would like to compute percentiles (ie. 90%, 80%, 70% ... 10%) of variable.

I have done this in pure Python, similar to this pseudocode:

bucket=sum(fr)/10.0
percentile=1
sum=0
for (current_fr, current_id) in zip(fr,id):
   sum=sum+current_fr
   if (sum > percentile*bucket):
      print "%i percentile: %i" % (percentile*10,current_id)
      percentile=percentile+1

But this code is very raw: it doesn't take into account that percentile should be between values from the set, it can't step back etc.

Is there any more elegant, universal, ready-made solution?

回答1:

Seems like you want cumulative sum of fr. You can do

cumfr = [sum(fr(:i+1)) for i in range(len(fr))]

Then the percentiles are

percentile = [100*i/cumfr[-1] for i in cumfr]

来源：https://stackoverflow.com/questions/38656609/how-to-compute-percentiles-from-frequency-table

标签

python

numpy

pandas

statistics

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!