How to compute the probability of a value given a list of samples from a distribution in Python?

雨燕双飞 提交于 2019-11-28 17:06:45

Since you don't seem to have a specific distribution in mind, but you might have a lot of data samples, I suggest using a non-parametric density estimation method. One of the data types you describe (time in ms) is clearly continuous, and one method for non-parametric estimation of a probability density function (PDF) for continuous random variables is the histogram that you already mentioned. However, as you will see below, Kernel Density Estimation (KDE) can be better. The second type of data you describe (number of characters in a sequence) is of the discrete kind. Here, kernel density estimation can also be useful and can be seen as a smoothing technique for the situations where you don't have a sufficient amount of samples for all values of the discrete variable.

Estimating Density

The example below shows how to first generate data samples from a mixture of 2 Gaussian distributions and then apply kernel density estimation to find the probability density function:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from sklearn.neighbors import KernelDensity

# Generate random samples from a mixture of 2 Gaussians
# with modes at 5 and 10
data = np.concatenate((5 + np.random.randn(10, 1),
                       10 + np.random.randn(30, 1)))

# Plot the true distribution
x = np.linspace(0, 16, 1000)[:, np.newaxis]
norm_vals = mlab.normpdf(x, 5, 1) * 0.25 + mlab.normpdf(x, 10, 1) * 0.75
plt.plot(x, norm_vals)

# Plot the data using a normalized histogram
plt.hist(data, 50, normed=True)

# Do kernel density estimation
kd = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(data)

# Plot the estimated densty
kd_vals = np.exp(kd.score_samples(x))
plt.plot(x, kd_vals)

# Show the plots
plt.show()

This will produce the following plot, where the true distribution is shown in blue, the histogram is shown in green, and the PDF estimated using KDE is shown in red:

As you can see, in this situation, the PDF approximated by the histogram is not very useful, while KDE provides a much better estimate. However, with a larger number of data samples and a proper choice of bin size, histogram might produce a good estimate as well.

The parameters you can tune in case of KDE are the kernel and the bandwidth. You can think about the kernel as the building block for the estimated PDF, and several kernel functions are available in Scikit Learn: gaussian, tophat, epanechnikov, exponential, linear, cosine. Changing the bandwidth allows you to adjust the bias-variance trade-off. Larger bandwidth will result in increased bias, which is good if you have less data samples. Smaller bandwidth will increase variance (fewer samples are included into the estimation), but will give a better estimate when more samples are available.

Calculating Probability

For a PDF, probability is obtained by calculating the integral over a range of values. As you noticed, that will lead to the probability 0 for a specific value.

Scikit Learn does not seem to have a builtin function for calculating probability. However, it is easy to estimate the integral of the PDF over a range. We can do it by evaluating the PDF multiple times within the range and summing the obtained values multiplied by the step size between each evaluation point. In the example below, N samples are obtained with step step.

# Get probability for range of values
start = 5  # Start of the range
end = 6    # End of the range
N = 100    # Number of evaluation points 
step = (end - start) / (N - 1)  # Step size
x = np.linspace(start, end, N)[:, np.newaxis]  # Generate values in the range
kd_vals = np.exp(kd.score_samples(x))  # Get PDF values for each x
probability = np.sum(kd_vals * step)  # Approximate the integral of the PDF
print(probability)

Please note that kd.score_samples generates log-likelihood of the data samples. Therefore, np.exp is needed to obtain likelihood.

The same computation can be performed using builtin SciPy integration methods, which will give a bit more accurate result:

from scipy.integrate import quad
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]

For instance, for one run, the first method calculated the probability as 0.0859024655305, while the second method produced 0.0850974209996139.

OK I offer this as a starting point, but estimating densities is a very broad topic. For your case involving the amount of characters in a sequence, we can model this from a straight-forward frequentist perspective using empirical probability. Here, probability is essentially a generalization of the concept of percentage. In our model, the sample space is discrete and is all positive integers. Well, then you simply count the occurrences and divide by the total number of events to get your estimate for the probabilities. Anywhere we have zero observations, our estimate for the probability is zero.

>>> samples = [1,1,2,3,2,2,7,8,3,4,1,1,2,6,5,4,8,9,4,3]
>>> from collections import Counter
>>> counts = Counter(samples)
>>> counts
Counter({1: 4, 2: 4, 3: 3, 4: 3, 8: 2, 5: 1, 6: 1, 7: 1, 9: 1})
>>> total = sum(counts.values())
>>> total
20
>>> probability_mass = {k:v/total for k,v in counts.items()}
>>> probability_mass
{1: 0.2, 2: 0.2, 3: 0.15, 4: 0.15, 5: 0.05, 6: 0.05, 7: 0.05, 8: 0.1, 9: 0.05}
>>> probability_mass.get(2,0)
0.2
>>> probability_mass.get(12,0)
0

Now, for your timing data, it is more natural to model this as a continuous distribution. Instead of using a parametric approach where you assume that your data has some distribution and then fit that distribution to your data, you should take a non-parametric approach. One straightforward way is to use a kernel density estimate. You can simply think of this as a way of smoothing a histogram to give you a continuous probability density function. There are several libraries available. Perhaps the most straightforward for univariate data is scipy's:

>>> import scipy.stats
>>> kde = scipy.stats.gaussian_kde(samples)
>>> kde.pdf(2)
array([ 0.15086911])

To get the probability of an observation in some interval:

>>> kde.integrate_box_1d(1,2)
0.13855869478828692

Here is one possible solution. You count the number of occurrences of each value in the original list. The future probability for a given value is its past rate of occurrence, which is simply the # of past occurrences divided by the length of the original list. In Python it's very simple:

x is the given list of values

from collections import Counter
c = Counter(x)

def probability(a):
    # returns the probability of a given number a
    return float(c[a]) / len(x)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!