Here is a C++ code:
#define ARR_SIZE_TEST ( 8 * 1024 * 1024 )
void cpp_tst_add( unsigned* x, unsigned* y )
{
for ( register int i = 0; i < ARR_SIZE_T
The NEON pipeline on Cortex-A8 is in-order executing, and has limited hit-under-miss (no renaming), so you're limited by memory latency (as you're using more than L1/L2 cache size). Your code has immediate dependencies on the values loaded from memory, so it'll stall constantly waiting for memory. This would explain why the NEON code is slightly (by a tiny amount) slower than non-NEON.
You need to unroll the assembly loops and increase the distance between load and use, e.g:
vld1.32 {q0}, [%[x]]!
vld1.32 {q1}, [%[y]]!
vld1.32 {q2}, [%[x]]!
vld1.32 {q3}, [%[y]]!
vadd.i32 q0 ,q0, q1
vadd.i32 q2 ,q2, q3
...
There's plenty of neon registers so you can unroll it a lot. Integer code will suffer the same issue, to a lesser extent because A8 integer has better hit-under-miss instead of stalling. The bottleneck is going to be memory bandwidth/latency for benchmarks so large compared to L1/L2 cache. You might also want to run the benchmark at smaller sizes (4KB..256KB) to see effects when data is cached entirely in L1 and/or L2.