How to optimize histogram statistics with neon intrinsics?
I want to optimize histogram statistic code with neon intrinsics.But I didn't succeed.Here is the c code: #define NUM (7*1024*1024) uint8 src_data[NUM]; uint32 histogram_result[256] = {0}; for (int i = 0; i < NUM; i++) { histogram_result[src_data[i]]++; } Historam statistic is more like serial processing.It's difficult to optimize with neon intrinsics.Does anyone know how to optimize?Thanks in advance. You can't vectorise the stores directly, but you can pipeline them, and you can vectorise the address calculation on 32-bit platforms (and to a lesser extent on 64-bit platforms). The first