Fitting lognormal distribution using Scipy vs Matlab

匿名 (未验证) 提交于 2019-12-03 02:06:01

问题:

I am trying to fit a lognormal distribution using Scipy. I've already done it using Matlab before but because of the need to extend the application beyond statistical analysis, I am in the process of trying to reproduce the fitted values in Scipy.

Below is the Matlab code I used to fit my data:

% Read input data (one value per line) x = []; fid = fopen(file_path, 'r'); % reading is default action for fopen disp('Reading network degree data...'); if fid == -1     disp('[ERROR] Unable to open data file.') else     while ~feof(fid)         [x] = [x fscanf(fid, '%f', [1])];      end     c = fclose(fid);     if c == 0          disp('File closed successfully.');     else         disp('[ERROR] There was a problem with closing the file.');     end end  [f,xx] = ecdf(x); y = 1-f;  parmhat  = lognfit(x); % MLE estimate mu = parmhat(1); sigma = parmhat(2); 

And here's the fitted plot:

Now here's my Python code with the aim of achieving the same:

import math from scipy import stats from statsmodels.distributions.empirical_distribution import ECDF   # The same input is read as a list in Python ecdf_func = ECDF(degrees) x = ecdf_func.x ccdf = 1-ecdf_func.y  # Fit data shape, loc, scale = stats.lognorm.fit(degrees, floc=0)  # Parameters sigma = shape # standard deviation mu = math.log(scale) # meanlog of the distribution  fit_ccdf = stats.lognorm.sf(x, [sigma], floc=1, scale=scale)  

Here's the fit using the Python code.

As you can see, both sets of code are capable of producing good fits, at least visually speaking.

Problem is that there is a huge difference in the estimated parameters mu and sigma.

From Matlab: mu = 1.62 sigma = 1.29. From Python: mu = 2.78 sigma = 1.74.

Why is there such a difference?

Note: I have double checked that both sets of data fitted are exactly the same. Same number of points, same distribution.

Your help is much appreciated! Thanks in advance.

Other info:

import scipy import numpy import statsmodels  scipy.__version__ '0.9.0'  numpy.__version__ '1.6.1'  statsmodels.__version__ '0.5.0.dev-1bbd4ca' 

Version of Matlab is R2011b.

Edition:

As demonstrated in the answer below, the fault lies with Scipy 0.9. I am able to reproduce the mu and sigma results from Matlab using Scipy 11.0.

An easy way to update your Scipy is:

pip install --upgrade Scipy 

If you don't have pip (you should!):

sudo apt-get install pip 

回答1:

There is a bug in the fit method in scipy 0.9.0 that has been fixed in later versions of scipy.

The output of the script below should be:

Explicit formula:   mu = 4.99203450, sig = 0.81691086 Fit log(x) to norm: mu = 4.99203450, sig = 0.81691086 Fit x to lognorm:   mu = 4.99203468, sig = 0.81691081 

but with scipy 0.9.0, it is

Explicit formula:   mu = 4.99203450, sig = 0.81691086 Fit log(x) to norm: mu = 4.99203450, sig = 0.81691086 Fit x to lognorm:   mu = 4.23197270, sig = 1.11581240 

The following test script shows three ways to get the same results:

import numpy as np from scipy import stats   def lognfit(x, ddof=0):     x = np.asarray(x)     logx = np.log(x)     mu = logx.mean()     sig = logx.std(ddof=ddof)     return mu, sig   # A simple data set for easy reproducibility x = np.array([50., 50, 100, 200, 200, 300, 500])  # Explicit formula my_mu, my_sig = lognfit(x)  # Fit a normal distribution to log(x) norm_mu, norm_sig = stats.norm.fit(np.log(x))  # Fit the lognormal distribution lognorm_sig, _, lognorm_expmu = stats.lognorm.fit(x, floc=0)  print "Explicit formula:   mu = %10.8f, sig = %10.8f" % (my_mu, my_sig) print "Fit log(x) to norm: mu = %10.8f, sig = %10.8f" % (norm_mu, norm_sig) print "Fit x to lognorm:   mu = %10.8f, sig = %10.8f" % (np.log(lognorm_expmu), lognorm_sig) 

With the option ddof=1 in the std. dev. calculation to use the unbiased variance estimation:

In [104]: x Out[104]: array([  50.,   50.,  100.,  200.,  200.,  300.,  500.])  In [105]: lognfit(x, ddof=1) Out[105]: (4.9920345004312647, 0.88236457185021866) 

There is a note in matlab's lognfit documentation that says when censoring is not used, lognfit computes sigma using the square root of the unbiased estimator of the variance. This corresponds to using ddof=1 in the above code.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!