Fastest de-interleave operation in C?

后端 未结 6 1713
一个人的身影
一个人的身影 2021-01-02 00:30

I have a pointer to an array of bytes mixed that contains the interleaved bytes of two distinct arrays array1 and array2. Say mi

6条回答
  •  夕颜
    夕颜 (楼主)
    2021-01-02 00:55

    Off the top of my head, I don't know of a library function for de-interleaving 2 channel byte data. However it's worth filing a bug report with Apple to request such a function.

    In the meantime, it's pretty easy to vectorize such a function using NEON or SSE intrinsics. Specifically, on ARM you will want to use vld1q_u8 to load a vector from each source array, vuzpq_u8 to de-interleave them, and vst1q_u8 to store the resulting vectors; here's a rough sketch that I haven't tested or even tried to build, but it should illustrate the general idea. More sophisticated implementations are definitely possible (in particular, NEON can load/store two 16B registers in a single instruction, which the compiler may not do with this, and some amount of pipelining and/or unrolling may be beneficial depending on how long your buffers are):

    #if defined __ARM_NEON__
    #   include 
    #endif
    #include 
    #include 
    
    void deinterleave(uint8_t *mixed, uint8_t *array1, uint8_t *array2, size_t mixedLength) {
    #if defined __ARM_NEON__
        size_t vectors = mixedLength / 32;
        mixedLength %= 32;
        while (vectors --> 0) {
            const uint8x16_t src0 = vld1q_u8(mixed);
            const uint8x16_t src1 = vld1q_u8(mixed + 16);
            const uint8x16x2_t dst = vuzpq_u8(src0, src1);
            vst1q_u8(array1, dst.val[0]);
            vst1q_u8(array2, dst.val[1]);
            mixed += 32;
            array1 += 16;
            array2 += 16;
        }
    #endif
        for (size_t i=0; i

提交回复
热议问题