Matplotlib: How to convert a histogram to a discrete probability mass function?

问题

I have a question regarding the hist() function with matplotlib.

I am writing a code to plot a histogram of data who's value varies from 0 to 1. For example:

values = [0.21, 0.51, 0.41, 0.21, 0.81, 0.99]

bins = np.arange(0, 1.1, 0.1)
a, b, c = plt.hist(values, bins=bins, normed=0)
plt.show()

The code above generates a correct histogram (I could not post an image since I do not have enough reputation). In terms of frequencies, it looks like:

[0 0 2 0 1 1 0 0 1 1]

I would like to convert this output to a discrete probability mass function, i.e. for the above example, I would like to get a following frequency values:

[ 0.  0.  0.333333333  0.  0.166666667  0.166666667  0.  0.  0.166666667  0.166666667 ] # each item in the previous array divided by 6)

I thought I simply need to change the parameter in the hist() function to 'normed=1'. However, I get the following histogram frequencies:

[ 0.  0.  3.33333333  0.  1.66666667  1.66666667  0.  0.  1.66666667  1.66666667 ]

This is not what I expect and I don't know how to get the discrete probability mass function who's sum should be 1.0. A similar question was asked in the following link (link to the question), but I do not think the question was resolved.

I appreciate for your help in advance.

回答1:

The reason is norm=True gives the probability density function. In probability theory, a probability density function or density of a continuous random variable, describes the relative likelihood for this random variable to take on a given value.

Let us consider a very simple example.

x=np.arange(0.1,1.1,0.1)
array([ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

# Bin size
bins = np.arange(0.05, 1.15, 0.1)
np.histogram(x,bins=bins,normed=1)[0]
[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]
np.histogram(x,bins=bins,normed=0)[0]/float(len(x))
[ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1]

# Change the bin size
bins = np.arange(0.05, 1.15, 0.2)
np.histogram(x,bins=bins,normed=1)[0]
[ 1.,  1.,  1.,  1.,  1.]
np.histogram(x,bins=bins,normed=0)[0]/float(len(x))
[ 0.2,  0.2,  0.2,  0.2,  0.2]

As, you can see in the above, the probability that x will lie between [0.05-0.15] or [0.15-0.25] is 1/10 whereas if you change the bin size to 0.2 then the probability that it will lie between [0.05-0.25] or [0.25-0.45] is 1/5. Now these actual probability values are dependent on the bin-size, however, the probability density is independent of the bins size. Thus, this is the only proper way to do the above, otherwise one would need to state the bin-width in each of the plot.

So in your case if you really want to plot the probability value at each bin (and not the probability density) then you can simply divide the frequency of each histogram by the number of total elements. However, I would suggest you not to do this unless you are working with discrete variables and each of your bins represent a single possible value of this variable.

回答2:

Plotting a Continuous Probability Function(PDF) from a Histogram – Solved in Python. refer this blog for detailed explanation. (http://howdoudoittheeasiestway.blogspot.com/2017/09/plotting-continuous-probability.html) Else you can use the code below.

n, bins, patches = plt.hist(A, 40, histtype='bar')
plt.show()
n = n/len(A)
n = np.append(n, 0)
mu = np.mean(n)
sigma = np.std(n)
plt.bar(bins,n, width=(bins[len(bins)-1]-bins[0])/40)
y1= (1/(sigma*np.sqrt(2*np.pi))*np.exp(-(bins - mu)**2 /(2*sigma**2)))*0.03
plt.plot(bins, y1, 'r--', linewidth=2)
plt.show()

来源：https://stackoverflow.com/questions/11750276/matplotlib-how-to-convert-a-histogram-to-a-discrete-probability-mass-function

标签

python

matplotlib

probability

histogram