How to properly fit a beta distribution in python?

故事扮演 提交于 2019-11-29 07:21:46

The problem is that beta.pdf() sometimes returns 0 and inf for 0 and 1. For example:

>>> from scipy.stats import beta
>>> beta.pdf(1,1.05,0.95)
/usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:1165: RuntimeWarning: divide by zero encountered in power
  Px = (1.0-x)**(b-1.0) * x**(a-1.0)
inf
>>> beta.pdf(0,1.05,0.95)
0.0

You're guaranteeing that you will have one data sample at 0 and 1 by your normalization process. Although you "correct" for values at which the pdf is 0, you are not correcting for those which return inf. To account for this you can just remove all the values which are not finite:

def betaNLL(param,*args):
    """
    Negative log likelihood function for beta
    <param>: list for parameters to be fitted.
    <args>: 1-element array containing the sample data.

    Return <nll>: negative log-likelihood to be minimized.
    """

    a, b = param
    data = args[0]
    pdf = beta.pdf(data,a,b,loc=0,scale=1)
    lg = np.log(pdf)
    mask = np.isfinite(lg)
    nll = -lg[mask].sum()
    return nll

Really you shouldn't be normalizing like this though, because you are essentially throwing two data points out of the fit.

Without a docstring for beta.fit, it was a little tricky to find, but if you know the upper and lower limits you want to force upon beta.fit, you can use the kwargs floc and fscale.

I ran your code only using the beta.fit method, but with and without the floc and fscale kwargs. Also, I checked it with the arguments as ints and floats to make sure that wouldn't affect your answer. It didn't (on this test. I can't say if it never would.)

>>> from scipy.stats import beta
>>> import numpy
>>> def betaNLL(param,*args):
    '''Negative log likelihood function for beta
    <param>: list for parameters to be fitted.
    <args>: 1-element array containing the sample data.

    Return <nll>: negative log-likelihood to be minimized.
    '''

    a,b=param
    data=args[0]
    pdf=beta.pdf(data,a,b,loc=0,scale=1)
    lg=numpy.log(pdf)
    #-----Replace -inf with 0s------
    lg=numpy.where(lg==-numpy.inf,0,lg)
    nll=-1*numpy.sum(lg)
    return nll

>>> data=beta.rvs(5,2,loc=0,scale=1,size=500)
>>> beta.fit(data)
(5.696963536654355, 2.0005252702837009, -0.060443307228404922, 1.0580278414086459)
>>> beta.fit(data,floc=0,fscale=1)
(5.0952451826831462, 1.9546341057106007, 0, 1)
>>> beta.fit(data,floc=0.,fscale=1.)
(5.0952451826831462, 1.9546341057106007, 0.0, 1.0)

In conclusion, it seems this doesn't change your data (through normalization) or throw out data. I just think it should be noted that care should be taken when using this. In your case, you knew the limits were 0 and 1 because you got data out of a defined distribution that was between 0 and 1. In other cases, limits might be known, but if they are not known, beta.fit will provide them. In this case, without specifying the limits of 0 and 1, beta.fit calculated them to be loc=-0.06 and scale=1.058.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!