SSE _mm_movemask_epi8 equivalent method for ARM NEON

后端 未结 4 2120
你的背包
你的背包 2020-12-10 00:05

I decided to continue Fast corners optimisation and stucked at _mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t

4条回答
  •  醉酒成梦
    2020-12-10 00:21

    Note that I haven't tested any of this, but something like this might work:

    X := the vector that you want to create the mask from
    A := 0x808080808080...
    B := 0x00FFFEFDFCFB...  (i.e. 0,-1,-2,-3,...)
    
    X = vand_u8(X, A);  // Keep d7 of each byte in X
    X = vshl_u8(X, B);  // X[7]>>=0; X[6]>>=1; X[5]>>=2; ...
    // Each byte of X now contains its msb shifted 7-N bits to the right, where N
    // is the byte index.
    // Do 3 pairwise adds in order to pack all these into X[0]
    X = vpadd_u8(X, X); 
    X = vpadd_u8(X, X); 
    X = vpadd_u8(X, X);
    // X[0] should now contain the mask. Clear the remaining bytes if necessary
    

    This would need to be repeated once to process a 128-bit vector, since vpadd only works on 64-bit vectors.

提交回复
热议问题