I am doing a project using python where I have two arrays of data. Let\'s call them pc and pnc. I am required to plot a cumulative distribution of both of
Using histograms is really unnecessarily heavy and imprecise (the binning makes the data fuzzy): you can just sort all the x values: the index of each value is the number of values that are smaller. This shorter and simpler solution looks like this:
import numpy as np
import matplotlib.pyplot as plt
# Some fake data:
data = np.random.randn(1000)
sorted_data = np.sort(data) # Or data.sort(), if data can be modified
# Cumulative counts:
plt.step(sorted_data, np.arange(sorted_data.size)) # From 0 to the number of data points-1
plt.step(sorted_data[::-1], np.arange(sorted_data.size)) # From the number of data points-1 to 0
plt.show()
Furthermore, a more appropriate plot style is indeed plt.step()
instead of plt.plot()
, since the data is in discrete locations.
The result is:
You can see that it is more ragged than the output of EnricoGiampieri's answer, but this one is the real histogram (instead of being an approximate, fuzzier version of it).
PS: As SebastianRaschka noted, the very last point should ideally show the total count (instead of the total count-1). This can be achieved with:
plt.step(np.concatenate([sorted_data, sorted_data[[-1]]]),
np.arange(sorted_data.size+1))
plt.step(np.concatenate([sorted_data[::-1], sorted_data[[0]]]),
np.arange(sorted_data.size+1))
There are so many points in data
that the effect is not visible without a zoom, but the very last point at the total count does matter when the data contains only a few points.