Methods to vectorise histogram in SIMD?

杀马特。学长 韩版系。学妹 提交于 2019-11-27 14:23:14

Histogramming is almost impossible to vectorize, unfortunately.

You can probably optimise the scalar code somewhat however - a common trick is to use two histograms and then combine them at the end. This allows you to overlap loads/increments/stores and thereby bury some of the serial dependencies and associated latencies. Pseudo code:

init histogram 1 to all 0s
init histogram 2 to all 0s
loop
  get input value 1
  get input value 2
  load count for value 1 from histogram 1
  load count for value 2 from histogram 2
  increment count for histogram 1
  increment count for histogram 2
  store count for value 1 to histogram 1
  store count for value 2 to histogram 2
until done
combine histogram 1 and histogram 2

ermig1979 has a Simd project which shows how he has done histograms using a similar approach to what @Paul-R has mentioned but also with SSE2 and AVX2 variants:

Project: https://github.com/ermig1979/Simd

Base file: https://github.com/ermig1979/Simd/blob/master/src/Simd/SimdBaseHistogram.cpp

An AVX2 implementation can be seen here: https://github.com/ermig1979/Simd/blob/master/src/Simd/SimdAvx2Histogram.cpp

A scalar solution can be seen below to illustrate the basic principle of creating multiple histograms that are summed at the end:

void Histogram(const uint8_t * src, size_t width, size_t height, size_t stride, 
    uint32_t * histogram)
{
    uint32_t histograms[4][HISTOGRAM_SIZE];
    memset(histograms, 0, sizeof(uint32_t)*HISTOGRAM_SIZE*4);
    size_t alignedWidth = Simd::AlignLo(width, 4);
    for(size_t row = 0; row < height; ++row)
    {
        size_t col = 0;
        for(; col < alignedWidth; col += 4)
        {
            ++histograms[0][src[col + 0]];
            ++histograms[1][src[col + 1]];
            ++histograms[2][src[col + 2]];
            ++histograms[3][src[col + 3]];
        }
        for(; col < width; ++col)
            ++histograms[0][src[col + 0]];

        src += stride;
    }

    for(size_t i = 0; i < HISTOGRAM_SIZE; ++i)
        histogram[i] = histograms[0][i] + histograms[1][i] + 
            histograms[2][i] + histograms[3][i];
}
Amir ElAttar

@Paul-R, there exists some paper at this link which discusses how to vectorize histogram functions:

SIMD Vectorization of Histogram Functions

Some image processing algorithm working on histograms (e.g. equalization, histogram matching) can be made work with known percentiles -- and for an approximation one can effectively parallelize the search to initial ranges (0,25,50,75,100%) consuming 4 accumulators.

Each item in the input stream must be compared in parallel to all slots, incrementing the frequency. After a certain number of rounds (e.g. n*255 rounds guaranteeing no overflows on uint8_t data type, then accumulating those to uint16_t) the min/max limits in each slot are recalculated based on linear interpolation. And it's of course possible to re-run the sequence based on an estimation how much the new data has changed the estimates of the percentiles.

The algorithm would be variant to evaluation order, which could be mitigated by random sampling and multiple passes.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!