Why ARM NEON not faster than plain C++?

前端 未结 5 467
隐瞒了意图╮
隐瞒了意图╮ 2020-12-22 18:21

Here is a C++ code:

#define ARR_SIZE_TEST ( 8 * 1024 * 1024 )

void cpp_tst_add( unsigned* x, unsigned* y )
{
    for ( register int i = 0; i < ARR_SIZE_T         


        
5条回答
  •  -上瘾入骨i
    2020-12-22 19:00

    The NEON pipeline on Cortex-A8 is in-order executing, and has limited hit-under-miss (no renaming), so you're limited by memory latency (as you're using more than L1/L2 cache size). Your code has immediate dependencies on the values loaded from memory, so it'll stall constantly waiting for memory. This would explain why the NEON code is slightly (by a tiny amount) slower than non-NEON.

    You need to unroll the assembly loops and increase the distance between load and use, e.g:

    vld1.32   {q0}, [%[x]]!
    vld1.32   {q1}, [%[y]]!
    vld1.32   {q2}, [%[x]]!
    vld1.32   {q3}, [%[y]]!
    vadd.i32  q0 ,q0, q1
    vadd.i32  q2 ,q2, q3
    ...
    

    There's plenty of neon registers so you can unroll it a lot. Integer code will suffer the same issue, to a lesser extent because A8 integer has better hit-under-miss instead of stalling. The bottleneck is going to be memory bandwidth/latency for benchmarks so large compared to L1/L2 cache. You might also want to run the benchmark at smaller sizes (4KB..256KB) to see effects when data is cached entirely in L1 and/or L2.

提交回复
热议问题