Calculate histograms along axis

匿名 (未验证) 提交于 2019-12-03 08:57:35

问题:

Is there a way to calculate many histograms along an axis of an nD-array? The method I currently have uses a for loop to iterate over all other axes and calculate a numpy.histogram() for each resulting 1D array:

import numpy import itertools data = numpy.random.rand(4, 5, 6)  # axis=-1, place `200001` and `[slice(None)]` on any other position to process along other axes out = numpy.zeros((4, 5, 200001), dtype="int64") indices = [     numpy.arange(4), numpy.arange(5), [slice(None)] ]  # Iterate over all axes, calculate histogram for each cell for idx in itertools.product(*indices):     out[idx] = numpy.histogram(         data[idx],         bins=2 * 100000 + 1,         range=(-100000 - 0.5, 100000 + 0.5),     )[0]  out.shape  # (4, 5, 200001) 

Needless to say this is very slow, however I couldn't find a way to solve this using numpy.histogram, numpy.histogram2d or numpy.histogramdd.

回答1:

Here's a vectorized approach making use of the efficient tools np.searchsorted and np.bincount. searchsorted gives us the loactions where each element is to be placed based on the bins and bincount does the counting for us.

Implementation -

def hist_laxis(data, n_bins, range_limits):     # Setup bins and determine the bin location for each element for the bins     R = range_limits     N = data.shape[-1]     bins = np.linspace(R[0],R[1],n_bins+1)     data2D = data.reshape(-1,N)     idx = np.searchsorted(bins, data2D,'right')-1      # Some elements would be off limits, so get a mask for those     bad_mask = (idx==-1) | (idx==n_bins)      # We need to use bincount to get bin based counts. To have unique IDs for     # each row and not get confused by the ones from other rows, we need to      # offset each row by a scale (using row length for this).     scaled_idx = n_bins*np.arange(data2D.shape[0])[:,None] + idx      # Set the bad ones to be last possible index+1 : n_bins*data2D.shape[0]     limit = n_bins*data2D.shape[0]     scaled_idx[bad_mask] = limit      # Get the counts and reshape to multi-dim     counts = np.bincount(scaled_idx.ravel(),minlength=limit+1)[:-1]     counts.shape = data.shape[:-1] + (n_bins,)     return counts 

Runtime test

Original approach -

def org_app(data, n_bins, range_limits):     R = range_limits     m,n = data.shape[:2]     out = np.zeros((m, n, n_bins), dtype="int64")     indices = [         np.arange(m), np.arange(n), [slice(None)]     ]      # Iterate over all axes, calculate histogram for each cell     for idx in itertools.product(*indices):         out[idx] = np.histogram(             data[idx],             bins=n_bins,             range=(R[0], R[1]),         )[0]     return out 

Timings and verification -

In [2]: data = np.random.randn(4, 5, 6)    ...: out1 = org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5))    ...: out2 = hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))    ...: print np.allclose(out1, out2)    ...:  True  In [3]: %timeit org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5)) 10 loops, best of 3: 39.3 ms per loop  In [4]: %timeit hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5)) 100 loops, best of 3: 3.17 ms per loop 

Since, in the loopy solution, we are looping through the first two axes. So, let's increase their lengths as that would show us how good is the vectorized one -

In [59]: data = np.random.randn(400, 500, 6)  In [60]: %timeit org_app(data, n_bins=21, range_limits=(- 2.5, 2.5)) 1 loops, best of 3: 9.59 s per loop  In [61]: %timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5)) 10 loops, best of 3: 44.2 ms per loop  In [62]: 9590/44.2          # Speedup number Out[62]: 216.9683257918552 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!