Is there a way to calculate many histograms along an axis of an nD-array? The method I currently have uses a for
loop to iterate over all other axes and calculate a numpy.histogram()
for each resulting 1D array:
import numpy import itertools data = numpy.random.rand(4, 5, 6) # axis=-1, place `200001` and `[slice(None)]` on any other position to process along other axes out = numpy.zeros((4, 5, 200001), dtype="int64") indices = [ numpy.arange(4), numpy.arange(5), [slice(None)] ] # Iterate over all axes, calculate histogram for each cell for idx in itertools.product(*indices): out[idx] = numpy.histogram( data[idx], bins=2 * 100000 + 1, range=(-100000 - 0.5, 100000 + 0.5), )[0] out.shape # (4, 5, 200001)
Needless to say this is very slow, however I couldn't find a way to solve this using numpy.histogram
, numpy.histogram2d
or numpy.histogramdd
.
Here's a vectorized approach making use of the efficient tools np.searchsorted
and np.bincount
. searchsorted
gives us the loactions where each element is to be placed based on the bins and bincount
does the counting for us.
Implementation -
def hist_laxis(data, n_bins, range_limits): # Setup bins and determine the bin location for each element for the bins R = range_limits N = data.shape[-1] bins = np.linspace(R[0],R[1],n_bins+1) data2D = data.reshape(-1,N) idx = np.searchsorted(bins, data2D,'right')-1 # Some elements would be off limits, so get a mask for those bad_mask = (idx==-1) | (idx==n_bins) # We need to use bincount to get bin based counts. To have unique IDs for # each row and not get confused by the ones from other rows, we need to # offset each row by a scale (using row length for this). scaled_idx = n_bins*np.arange(data2D.shape[0])[:,None] + idx # Set the bad ones to be last possible index+1 : n_bins*data2D.shape[0] limit = n_bins*data2D.shape[0] scaled_idx[bad_mask] = limit # Get the counts and reshape to multi-dim counts = np.bincount(scaled_idx.ravel(),minlength=limit+1)[:-1] counts.shape = data.shape[:-1] + (n_bins,) return counts
Runtime test
Original approach -
def org_app(data, n_bins, range_limits): R = range_limits m,n = data.shape[:2] out = np.zeros((m, n, n_bins), dtype="int64") indices = [ np.arange(m), np.arange(n), [slice(None)] ] # Iterate over all axes, calculate histogram for each cell for idx in itertools.product(*indices): out[idx] = np.histogram( data[idx], bins=n_bins, range=(R[0], R[1]), )[0] return out
Timings and verification -
In [2]: data = np.random.randn(4, 5, 6) ...: out1 = org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5)) ...: out2 = hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5)) ...: print np.allclose(out1, out2) ...: True In [3]: %timeit org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5)) 10 loops, best of 3: 39.3 ms per loop In [4]: %timeit hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5)) 100 loops, best of 3: 3.17 ms per loop
Since, in the loopy solution, we are looping through the first two axes. So, let's increase their lengths as that would show us how good is the vectorized one -
In [59]: data = np.random.randn(400, 500, 6) In [60]: %timeit org_app(data, n_bins=21, range_limits=(- 2.5, 2.5)) 1 loops, best of 3: 9.59 s per loop In [61]: %timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5)) 10 loops, best of 3: 44.2 ms per loop In [62]: 9590/44.2 # Speedup number Out[62]: 216.9683257918552