Why ARM NEON not faster than plain C++?

前端 未结 5 468
隐瞒了意图╮
隐瞒了意图╮ 2020-12-22 18:21

Here is a C++ code:

#define ARR_SIZE_TEST ( 8 * 1024 * 1024 )

void cpp_tst_add( unsigned* x, unsigned* y )
{
    for ( register int i = 0; i < ARR_SIZE_T         


        
5条回答
  •  既然无缘
    2020-12-22 18:49

    Your C++ code isn't optimized either.

    #define ARR_SIZE_TEST ( 8 * 1024 * 1024 )
    
    void cpp_tst_add( unsigned* x, unsigned* y )
    {
        unsigned int i = ARR_SIZE_TEST;
        do
        {
            *x++ += *y++;
        } (while --i);
    }
    

    this version consumes 2 less cycles/iteration.

    Besides, your benchmark results don't surprise me at all.

    32bit :

    This function is too simple for NEON. There aren't enough arithmetic operations leaving any room for optimizations.

    Yes, it's so simple that both C++ and NEON version suffer from pipeline hazards almost every time without any real chance of benefitting from the dual issue capabilities.

    While NEON version might benefit from processing 4 integers at once, it suffers much more from every hazard as well. That's all.

    8bit :

    ARM is VERY slow reading each byte from memory. Which means, while NEON shows the same characteristics as with 32bit, ARM is lagging heavily.

    16bit : The same here. Except ARM's 16bit read isn't THAT bad.

    float : The C++ version will compile into VFP codes. And there isn't a full VFP on Coretex A8, but VFP lite which doesn't pipeline anything which sucks.

    It's not that NEON is behaving strangely processing 32bit. It's just ARM that meets the ideal condition. Your function is very inappropriate for benchmarking purpose due to its simpleness. Try something more complex like YUV-RGB conversion :

    FYI, my fully optimized NEON version runs roughly 20 times as fast than my fully optimized C version and 8 times as fast than my fully optimized ARM assembly version. I hope that will give you some idea how powerful NEON can be.

    Last but not least, the ARM instruction PLD is NEON's best friend. Placed properly, it will bring at least 40% performance boost.

提交回复
热议问题