I need to make a plot that functions like a density plot for high-density regions on the plot, but below some threshold uses individual points. I couldn\'t find any existing
This should do it:
import matplotlib.pyplot as plt, numpy as np, numpy.random, scipy
#histogram definition
xyrange = [[-5,5],[-5,5]] # data range
bins = [100,100] # number of bins
thresh = 3 #density threshold
#data definition
N = 1e5;
xdat, ydat = np.random.normal(size=N), np.random.normal(1, 0.6, size=N)
# histogram the data
hh, locx, locy = scipy.histogram2d(xdat, ydat, range=xyrange, bins=bins)
posx = np.digitize(xdat, locx)
posy = np.digitize(ydat, locy)
#select points within the histogram
ind = (posx > 0) & (posx <= bins[0]) & (posy > 0) & (posy <= bins[1])
hhsub = hh[posx[ind] - 1, posy[ind] - 1] # values of the histogram where the points are
xdat1 = xdat[ind][hhsub < thresh] # low density points
ydat1 = ydat[ind][hhsub < thresh]
hh[hh < thresh] = np.nan # fill the areas with low density by NaNs
plt.imshow(np.flipud(hh.T),cmap='jet',extent=np.array(xyrange).flatten(), interpolation='none', origin='upper')
plt.colorbar()
plt.plot(xdat1, ydat1, '.',color='darkblue')
plt.show()
After a night to sleep on it and reading through Oz123's suggestions, I figured it out. The trick is to compute which bin each x,y point falls into (xi,yi), then test if H[xi,yi] (actually, in my case H[yi,xi]) is beneath the threshold. The code is below, and runs very fast for large numbers of points and is much cleaner:
import numpy as np
import math
import matplotlib as mpl
import matplotlib.pyplot as plt
import pylab
import numpy.random
#Create the colormap:
halfpurples = {'blue': [(0.0,1.0,1.0),(0.000001, 0.78431373834609985, 0.78431373834609985),
0.25, 0.729411780834198, 0.729411780834198), (0.5,
0.63921570777893066, 0.63921570777893066), (0.75,
0.56078433990478516, 0.56078433990478516), (1.0, 0.49019607901573181,
0.49019607901573181)],
'green': [(0.0,1.0,1.0),(0.000001,
0.60392159223556519, 0.60392159223556519), (0.25,
0.49019607901573181, 0.49019607901573181), (0.5,
0.31764706969261169, 0.31764706969261169), (0.75,
0.15294118225574493, 0.15294118225574493), (1.0, 0.0, 0.0)],
'red': [(0.0,1.0,1.0),(0.000001,
0.61960786581039429, 0.61960786581039429), (0.25,
0.50196081399917603, 0.50196081399917603), (0.5,
0.41568627953529358, 0.41568627953529358), (0.75,
0.32941177487373352, 0.32941177487373352), (1.0,
0.24705882370471954, 0.24705882370471954)]}
halfpurplecmap = mpl.colors.LinearSegmentedColormap('halfpurples',halfpurples,256)
#Create x,y arrays of normally distributed points
npts = 100000
x = numpy.random.standard_normal(npts)
y = numpy.random.standard_normal(npts)
#Set bin numbers in both axes
nxbins = 100
nybins = 100
#Set the cutoff for resolving the individual points
minperbin = 1
#Make the density histrogram
H, yedges, xedges = np.histogram2d(y,x,bins=(nybins,nxbins))
#Reorient the axes
H = H[::-1]
extent = [xedges[0],xedges[-1],yedges[0],yedges[-1]]
#Figure out which bin each x,y point is in
xbinsize = xedges[1]-xedges[0]
ybinsize = yedges[1]-yedges[0]
xi = ((x-xedges[0])/xbinsize).astype(np.integer)
yi = nybins-1-((y-yedges[0])/ybinsize).astype(np.integer)
#Subtract one from any points exactly on the right and upper edges of the region
xim1 = xi-1
yim1 = yi-1
xi = np.where(xi < nxbins,xi,xim1)
yi = np.where(yi < nybins,yi,yim1)
#Get all points with density below the threshold
lowdensityx = x[H[yi,xi] <= minperbin]
lowdensityy = y[H[yi,xi] <= minperbin]
#Plot
fig1 = plt.figure()
ax1 = fig1.add_subplot(111)
ax1.plot(lowdensityx,lowdensityy,linestyle='.',marker='o',mfc='k',mec='k',ms=3)
cp1 = ax1.imshow(H,interpolation='nearest',extent=extent,cmap=halfpurplecmap,vmin=minperbin)
fig1.colorbar(cp1)
fig1.savefig('contourtest.eps')
For the record, here is the result of a new attempt using scipy.stats.gaussian_kde rather than a 2D histogram. One could envision different combinations of color meshing and contouring depending on the purpose.
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde
# parameters
npts = 5000 # number of sample points
bins = 100 # number of bins in density maps
threshold = 0.01 # density threshold for scatter plot
# initialize figure
fig, ax = plt.subplots()
# create a random dataset
x1, y1 = np.random.multivariate_normal([0, 0], [[1, 0], [0, 1]], npts/2).T
x2, y2 = np.random.multivariate_normal([4, 4], [[4, 0], [0, 1]], npts/2).T
x = np.hstack((x1, x2))
y = np.hstack((y1, y2))
points = np.vstack([x, y])
# perform kernel density estimate
kde = gaussian_kde(points)
z = kde(points)
# mask points above density threshold
x = np.ma.masked_where(z > threshold, x)
y = np.ma.masked_where(z > threshold, y)
# plot unmasked points
ax.scatter(x, y, c='black', marker='.')
# get bounds from axes
xmin, xmax = ax.get_xlim()
ymin, ymax = ax.get_ylim()
# prepare grid for density map
xedges = np.linspace(xmin, xmax, bins)
yedges = np.linspace(ymin, ymax, bins)
xx, yy = np.meshgrid(xedges, yedges)
gridpoints = np.array([xx.ravel(), yy.ravel()])
# compute density map
zz = np.reshape(kde(gridpoints), xx.shape)
# plot density map
im = ax.imshow(zz, cmap='CMRmap_r', interpolation='nearest',
origin='lower', extent=[xmin, xmax, ymin, ymax])
# plot threshold contour
cs = ax.contour(xx, yy, zz, levels=[threshold], colors='black')
# show
fig.colorbar(im)
plt.show()
Your problem is quadratic - for npts = 1000, you have array size reaching 10^6 points, and than you iterate over these lists with list comprehensions.
Now, this is a matter of taste of course, but I find that list comprehension can yield a totally code which is hard to follow, and they are only slightly faster sometimes ... but that's not my point.
My point is that for large array operations you have numpy functions like:
np.where, np.choose etc.
See that you can achieve that functionality of the list comprehensions with NumPy, and your code should run faster.
Do I understand correctly, your comment ?
#Find all points that lie in these regions
are you testing for a point inside a polygon ? if so, consider point in polygon inside matplotlib.