How to find probability distribution and parameters for real data? (Python 3)

后端 未结 4 439
暖寄归人
暖寄归人 2020-12-02 06:44

I have a dataset from sklearn and I plotted the distribution of the load_diabetes.target data (i.e. the values of the regression that the loa

相关标签:
4条回答
  • 2020-12-02 07:00

    To the best of my knowledge, there is no automatic way of obtaining the distribution type and parameters of a sample (as inferring the distribution of a sample is a statistical problem by itself).

    In my opinion, the best you can do is:

    (for each attribute)

    • Try to fit each attribute to a reasonably large list of possible distributions (e.g. see Fitting empirical distribution to theoretical ones with Scipy (Python)? for an example with Scipy)

    • Evaluate all your fits and pick the best one. This can be done by performing a Kolmogorov-Smirnov test between your sample and each of the distributions of the fit (you have an implementation in Scipy, again), and picking the one that minimises D, the test statistic (a.k.a. the difference between the sample and the fit).

    Bonus: It would make sense - as you'll be building a model on each of the variables as you pick a fit for each one - although the goodness of your prediction would depend on the quality of your data and the distributions you are using for fitting. You are building a model, after all.

    0 讨论(0)
  • 2020-12-02 07:01

    On a similar question (see here) you may be interrested in @Michel_Baudin answer explaining. His code assesses around 40 different distributions available OpenTURNS library and chooses the best one according to the BIC criterion. Looks something like that:

    import openturns as ot
    
    sample = ot.Sample([[x] for x in your_data_list])
    tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
    best_model, best_bic = ot.FittingTest.BestModelBIC(sample, tested_factories)
    
    0 讨论(0)
  • 2020-12-02 07:07

    Use this approach

    import scipy.stats as st
    def get_best_distribution(data):
        dist_names = ["norm", "exponweib", "weibull_max", "weibull_min", "pareto", "genextreme"]
        dist_results = []
        params = {}
        for dist_name in dist_names:
            dist = getattr(st, dist_name)
            param = dist.fit(data)
    
            params[dist_name] = param
            # Applying the Kolmogorov-Smirnov test
            D, p = st.kstest(data, dist_name, args=param)
            print("p value for "+dist_name+" = "+str(p))
            dist_results.append((dist_name, p))
    
        # select the best fitted distribution
        best_dist, best_p = (max(dist_results, key=lambda item: item[1]))
        # store the name of the best fit and its p value
    
        print("Best fitting distribution: "+str(best_dist))
        print("Best p value: "+ str(best_p))
        print("Parameters for the best fit: "+ str(params[best_dist]))
    
        return best_dist, best_p, params[best_dist]
    
    0 讨论(0)
  • 2020-12-02 07:08

    You can use that code to fit (according to the maximum likelihood) different distributions with your datas:

    import matplotlib.pyplot as plt
    import scipy
    import scipy.stats
    
    dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']
    
    for dist_name in dist_names:
        dist = getattr(scipy.stats, dist_name)
        param = dist.fit(y)
        # here's the parameters of your distribution, scale, location
    

    You can see a sample snippet about how to use the parameters obtained here: Fitting empirical distribution to theoretical ones with Scipy (Python)?

    Then, you can pick the distribution with the best log likelihood (there are also other criteria to match the "best" distribution, such as Bayesian posterior probability, AIC, BIC or BICc values, ...).

    For your bonus question, there's I think no generic answer. If your set of data is significant and obtained under the same conditions as the real word datas, you can do it.

    0 讨论(0)
提交回复
热议问题