How to compute the probability of a value given a list of samples from a distribution in Python?

后端 未结 3 476
春和景丽
春和景丽 2020-12-12 22:42

Not sure if this belongs in statistics, but I am trying to use Python to achieve this. I essentially just have a list of integers:

data = [300,244,543,1011,3         


        
3条回答
  •  谎友^
    谎友^ (楼主)
    2020-12-12 23:40

    OK I offer this as a starting point, but estimating densities is a very broad topic. For your case involving the amount of characters in a sequence, we can model this from a straight-forward frequentist perspective using empirical probability. Here, probability is essentially a generalization of the concept of percentage. In our model, the sample space is discrete and is all positive integers. Well, then you simply count the occurrences and divide by the total number of events to get your estimate for the probabilities. Anywhere we have zero observations, our estimate for the probability is zero.

    >>> samples = [1,1,2,3,2,2,7,8,3,4,1,1,2,6,5,4,8,9,4,3]
    >>> from collections import Counter
    >>> counts = Counter(samples)
    >>> counts
    Counter({1: 4, 2: 4, 3: 3, 4: 3, 8: 2, 5: 1, 6: 1, 7: 1, 9: 1})
    >>> total = sum(counts.values())
    >>> total
    20
    >>> probability_mass = {k:v/total for k,v in counts.items()}
    >>> probability_mass
    {1: 0.2, 2: 0.2, 3: 0.15, 4: 0.15, 5: 0.05, 6: 0.05, 7: 0.05, 8: 0.1, 9: 0.05}
    >>> probability_mass.get(2,0)
    0.2
    >>> probability_mass.get(12,0)
    0
    

    Now, for your timing data, it is more natural to model this as a continuous distribution. Instead of using a parametric approach where you assume that your data has some distribution and then fit that distribution to your data, you should take a non-parametric approach. One straightforward way is to use a kernel density estimate. You can simply think of this as a way of smoothing a histogram to give you a continuous probability density function. There are several libraries available. Perhaps the most straightforward for univariate data is scipy's:

    >>> import scipy.stats
    >>> kde = scipy.stats.gaussian_kde(samples)
    >>> kde.pdf(2)
    array([ 0.15086911])
    

    To get the probability of an observation in some interval:

    >>> kde.integrate_box_1d(1,2)
    0.13855869478828692
    

提交回复
热议问题