Is there a way to use the numpy.percentile function to compute weighted percentile? Or is anyone aware of an alternative python function to compute weighted percentile?
A quick solution, by first sorting and then interpolating:
def weighted_percentile(data, percents, weights=None):
''' percents in units of 1%
weights specifies the frequency (count) of data.
'''
if weights is None:
return np.percentile(data, percents)
ind=np.argsort(data)
d=data[ind]
w=weights[ind]
p=1.*w.cumsum()/w.sum()*100
y=np.interp(percents, p, d)
return y
I don' know what's Weighted percentile means, but from @Joan Smith's answer, It seems that you just need to repeat every element in ar
, you can use numpy.repeat()
:
import numpy as np
np.repeat([1,2,3], [4,5,6])
the result is:
array([1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3])
As mentioned in comments, simply repeating values is impossible for float weights, and impractical for very large datasets. There is a library that does weighted percentiles here: http://kochanski.org/gpk/code/speechresearch/gmisclib/gmisclib.weighted_percentile-module.html It worked for me.
here my solution:
def my_weighted_perc(data,perc,weights=None):
if weights==None:
return nanpercentile(data,perc)
else:
d=data[(~np.isnan(data))&(~np.isnan(weights))]
ix=np.argsort(d)
d=d[ix]
wei=weights[ix]
wei_cum=100.*cumsum(wei*1./sum(wei))
return interp(perc,wei_cum,d)
it simply calculates the weighted CDF of the data and then it uses to estimate the weighted percentiles.
Unfortunately, numpy doesn't have built-in weighted functions for everything, but, you can always put something together.
def weight_array(ar, weights):
zipped = zip(ar, weights)
weighted = []
for a, w in zipped:
for j in range(w):
weighted.append(a)
return weighted
np.percentile(weight_array(ar, weights), 25)
I use this function for my needs:
def quantile_at_values(values, population, weights=None):
values = numpy.atleast_1d(values).astype(float)
population = numpy.atleast_1d(population).astype(float)
# if no weights are given, use equal weights
if weights is None:
weights = numpy.ones(population.shape).astype(float)
normal = float(len(weights))
# else, check weights
else:
weights = numpy.atleast_1d(weights).astype(float)
assert len(weights) == len(population)
assert (weights >= 0).all()
normal = numpy.sum(weights)
assert normal > 0.
quantiles = numpy.array([numpy.sum(weights[population <= value]) for value in values]) / normal
assert (quantiles >= 0).all() and (quantiles <= 1).all()
return quantiles
Multiply results by 100 if you want percentiles instead of quantiles.