Empirical Distribution Function in Numpy

余生颓废 提交于 2019-12-19 04:55:15

问题


I have the following list of values:

x = [-0.04124324405924407, 0, 0.005249724476788287, 0.03599351958245578, -0.00252785423151014, 0.01007584102031178, -0.002510349639322063,...]

and I want to calculate the empirical density function, so I think I need to calculate the empirical cumulative distribution function and I've used this code:

counts = np.asarray(np.bincount(x), dtype=float)
cdf = counts.cumsum() / counts.sum()

and then I calculate this value:

print cdf[0.01007584102031178]

and I always get 1 so I guess I made a mistake. Do you know how to fix it? Thanks!


回答1:


The usual definition of the empirical cdf is the number of observations lesser than or equal to the given value divided by the total number of observations. Using 1d numpy arrays this is x[x <= v].size / x.size (float division, in python2 you need from __future__ import division):

x = np.array([-0.04124324405924407,  0,
               0.005249724476788287, 0.03599351958245578,
              -0.00252785423151014,  0.01007584102031178,
              -0.002510349639322063])
v = 0.01007584102031178
print(x[x <= v].size / x.size)

Will print 0.857142857143, (the actual value if the empirical cdf at 0.01007584102031178 is 6 / 7).

This is quite expensive if your array is large and you need to compute the cdf for several values. In such cases you can keep a sorted copy of your data and use np.searchsorted() to find out the number of observations <= v:

def ecdf(x):
    x = np.sort(x)
    def result(v):
        return np.searchsorted(x, v, side='right') / x.size
    return result

cdf = ecdf(x)
print(cdf(v))



回答2:


There are two things going wrong here:

np.bincount only makes sense on an array of integers. It creates a histogram of the array values, rounded to an integer. For a more soffisticated histogram, use np.histogram. It can work on floats, and you can explicitely state bin count or bin borders, as well as normalization.

Additionally, cdf denotes a normal numpy array in your case. The array indices can only be integers, so your query cdf[0.01007584102031178] is rounded down to cdf[0].

So in total, your code does first count the integers (they are all rounded to 0), so your normalized cdf is afterwards just cdf == [ 1. ]. Then you index gets rounded down, so you query cdf[0] which is 1.



来源:https://stackoverflow.com/questions/36353997/empirical-distribution-function-in-numpy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!