可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Is there a way to calculate many histograms along an axis of an nD-array? The method I currently have uses a for loop to iterate over all other axes and calculate a numpy.histogram() for each resulting 1D array:

import numpy import itertools data = numpy.random.rand(4, 5, 6)  # axis=-1, place `200001` and `[slice(None)]` on any other position to process along other axes out = numpy.zeros((4, 5, 200001), dtype="int64") indices = [     numpy.arange(4), numpy.arange(5), [slice(None)] ]  # Iterate over all axes, calculate histogram for each cell for idx in itertools.product(*indices):     out[idx] = numpy.histogram(         data[idx],         bins=2 * 100000 + 1,         range=(-100000 - 0.5, 100000 + 0.5),     )[0]  out.shape  # (4, 5, 200001)

Needless to say this is very slow, however I couldn't find a way to solve this using numpy.histogram, numpy.histogram2d or numpy.histogramdd.

回答1:

Here's a vectorized approach making use of the efficient tools np.searchsorted and np.bincount. searchsorted gives us the loactions where each element is to be placed based on the bins and bincount does the counting for us.

Implementation -

def hist_laxis(data, n_bins, range_limits):     # Setup bins and determine the bin location for each element for the bins     R = range_limits     N = data.shape[-1]     bins = np.linspace(R[0],R[1],n_bins+1)     data2D = data.reshape(-1,N)     idx = np.searchsorted(bins, data2D,'right')-1      # Some elements would be off limits, so get a mask for those     bad_mask = (idx==-1) | (idx==n_bins)      # We need to use bincount to get bin based counts. To have unique IDs for     # each row and not get confused by the ones from other rows, we need to      # offset each row by a scale (using row length for this).     scaled_idx = n_bins*np.arange(data2D.shape[0])[:,None] + idx      # Set the bad ones to be last possible index+1 : n_bins*data2D.shape[0]     limit = n_bins*data2D.shape[0]     scaled_idx[bad_mask] = limit      # Get the counts and reshape to multi-dim     counts = np.bincount(scaled_idx.ravel(),minlength=limit+1)[:-1]     counts.shape = data.shape[:-1] + (n_bins,)     return counts

Runtime test

Original approach -

def org_app(data, n_bins, range_limits):     R = range_limits     m,n = data.shape[:2]     out = np.zeros((m, n, n_bins), dtype="int64")     indices = [         np.arange(m), np.arange(n), [slice(None)]     ]      # Iterate over all axes, calculate histogram for each cell     for idx in itertools.product(*indices):         out[idx] = np.histogram(             data[idx],             bins=n_bins,             range=(R[0], R[1]),         )[0]     return out

Timings and verification -

In [2]: data = np.random.randn(4, 5, 6)    ...: out1 = org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5))    ...: out2 = hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))    ...: print np.allclose(out1, out2)    ...:  True  In [3]: %timeit org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5)) 10 loops, best of 3: 39.3 ms per loop  In [4]: %timeit hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5)) 100 loops, best of 3: 3.17 ms per loop

Since, in the loopy solution, we are looping through the first two axes. So, let's increase their lengths as that would show us how good is the vectorized one -

In [59]: data = np.random.randn(400, 500, 6)  In [60]: %timeit org_app(data, n_bins=21, range_limits=(- 2.5, 2.5)) 1 loops, best of 3: 9.59 s per loop  In [61]: %timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5)) 10 loops, best of 3: 44.2 ms per loop  In [62]: 9590/44.2          # Speedup number Out[62]: 216.9683257918552

文章来源: Calculate histograms along axis

标签

axis