Empirical Distribution Function in Numpy

问题

I have the following list of values:

x = [-0.04124324405924407, 0, 0.005249724476788287, 0.03599351958245578, -0.00252785423151014, 0.01007584102031178, -0.002510349639322063,...]

and I want to calculate the empirical density function, so I think I need to calculate the empirical cumulative distribution function and I've used this code:

counts = np.asarray(np.bincount(x), dtype=float)
cdf = counts.cumsum() / counts.sum()

and then I calculate this value:

print cdf[0.01007584102031178]

and I always get 1 so I guess I made a mistake. Do you know how to fix it? Thanks!

回答1:

The usual definition of the empirical cdf is the number of observations lesser than or equal to the given value divided by the total number of observations. Using 1d numpy arrays this is x[x <= v].size / x.size (float division, in python2 you need from __future__ import division):

x = np.array([-0.04124324405924407,  0,
               0.005249724476788287, 0.03599351958245578,
              -0.00252785423151014,  0.01007584102031178,
              -0.002510349639322063])
v = 0.01007584102031178
print(x[x <= v].size / x.size)

Will print 0.857142857143, (the actual value if the empirical cdf at 0.01007584102031178 is 6 / 7).

This is quite expensive if your array is large and you need to compute the cdf for several values. In such cases you can keep a sorted copy of your data and use np.searchsorted() to find out the number of observations <= v:

def ecdf(x):
    x = np.sort(x)
    def result(v):
        return np.searchsorted(x, v, side='right') / x.size
    return result

cdf = ecdf(x)
print(cdf(v))

回答2:

There are two things going wrong here:

np.bincount only makes sense on an array of integers. It creates a histogram of the array values, rounded to an integer. For a more soffisticated histogram, use np.histogram. It can work on floats, and you can explicitely state bin count or bin borders, as well as normalization.

Additionally, cdf denotes a normal numpy array in your case. The array indices can only be integers, so your query cdf[0.01007584102031178] is rounded down to cdf[0].

So in total, your code does first count the integers (they are all rounded to 0), so your normalized cdf is afterwards just cdf == [ 1. ]. Then you index gets rounded down, so you query cdf[0] which is 1.

来源：https://stackoverflow.com/questions/36353997/empirical-distribution-function-in-numpy

标签

python

statistics