问题
I have CSV file:
fr id
1 10000152
1 10000212
1 10000847
1 10001018
2 10001052
2 10001246
14 10001908
...........
This is a frequency table, where id
is integer variable and fr
is number of occurrences given value. File is sorted ascending by value.
I would like to compute percentiles (ie. 90%, 80%, 70% ... 10%) of variable.
I have done this in pure Python, similar to this pseudocode:
bucket=sum(fr)/10.0
percentile=1
sum=0
for (current_fr, current_id) in zip(fr,id):
sum=sum+current_fr
if (sum > percentile*bucket):
print "%i percentile: %i" % (percentile*10,current_id)
percentile=percentile+1
But this code is very raw: it doesn't take into account that percentile should be between values from the set, it can't step back etc.
Is there any more elegant, universal, ready-made solution?
回答1:
Seems like you want cumulative sum of fr
. You can do
cumfr = [sum(fr(:i+1)) for i in range(len(fr))]
Then the percentiles are
percentile = [100*i/cumfr[-1] for i in cumfr]
来源:https://stackoverflow.com/questions/38656609/how-to-compute-percentiles-from-frequency-table