Fast method to copy memory with translation - ARGB to BGR

前端 未结 11 1888
野趣味
野趣味 2020-12-07 10:47

Overview

I have an image buffer that I need to convert to another format. The origin image buffer is four channels, 8 bits per channel, Alpha, Red, Green, and Blue

11条回答
  •  既然无缘
    2020-12-07 11:29

    Combining just a poseur's and Jitamaro's answers, if you assume that the inputs and outputs are 16-byte aligned and if you process pixels 4 at a time, you can use a combination of shuffles, masks, ands, and ors to store out using aligned stores. The main idea is to generate four intermediate data sets, then or them together with masks to select the relevant pixel values and write out 3 16-byte sets of pixel data. Note that I did not compile this or try to run it at all.

    EDIT2: More detail about the underlying code structure:

    With SSE2, you get better performance with 16-byte aligned reads and writes of 16 bytes. Since your 3 byte pixel is only alignable to 16-bytes for every 16 pixels, we batch up 16 pixels at a time using a combination of shuffles and masks and ors of 16 input pixels at a time.

    From LSB to MSB, the inputs look like this, ignoring the specific components:

    s[0]: 0000 0000 0000 0000
    s[1]: 1111 1111 1111 1111
    s[2]: 2222 2222 2222 2222
    s[3]: 3333 3333 3333 3333
    

    and the ouptuts look like this:

    d[0]: 000 000 000 000 111 1
    d[1]:  11 111 111 222 222 22
    d[2]:   2 222 333 333 333 333
    

    So to generate those outputs, you need to do the following (I will specify the actual transformations later):

    d[0]= combine_0(f_0_low(s[0]), f_0_high(s[1]))
    d[1]= combine_1(f_1_low(s[1]), f_1_high(s[2]))
    d[2]= combine_2(f_1_low(s[2]), f_1_high(s[3]))
    

    Now, what should combine_ look like? If we assume that d is merely s compacted together, we can concatenate two s's with a mask and an or:

    combine_x(left, right)= (left & mask(x)) | (right & ~mask(x))
    

    where (1 means select the left pixel, 0 means select the right pixel): mask(0)= 111 111 111 111 000 0 mask(1)= 11 111 111 000 000 00 mask(2)= 1 111 000 000 000 000

    But the actual transformations (f__low, f__high) are actually not that simple. Since we are reversing and removing bytes from the source pixel, the actual transformation is (for the first destination for brevity):

    d[0]= 
        s[0][0].Blue s[0][0].Green s[0][0].Red 
        s[0][1].Blue s[0][1].Green s[0][1].Red 
        s[0][2].Blue s[0][2].Green s[0][2].Red 
        s[0][3].Blue s[0][3].Green s[0][3].Red
        s[1][0].Blue s[1][0].Green s[1][0].Red
        s[1][1].Blue
    

    If you translate the above into byte offsets from source to dest, you get: d[0]= &s[0]+3 &s[0]+2 &s[0]+1
    &s[0]+7 &s[0]+6 &s[0]+5 &s[0]+11 &s[0]+10 &s[0]+9 &s[0]+15 &s[0]+14 &s[0]+13
    &s[1]+3 &s[1]+2 &s[1]+1
    &s[1]+7

    (If you take a look at all the s[0] offsets, they match just a poseur's shuffle mask in reverse order.)

    Now, we can generate a shuffle mask to map each source byte to a destination byte (X means we don't care what that value is):

    f_0_low=  3 2 1  7 6 5  11 10 9  15 14 13  X X X  X
    f_0_high= X X X  X X X   X  X X   X  X  X  3 2 1  7
    
    f_1_low=    6 5  11 10 9  15 14 13  X X X   X X X  X  X
    f_1_high=   X X   X  X X   X  X  X  3 2 1   7 6 5  11 10
    
    f_2_low=      9  15 14 13  X  X  X  X X X   X  X  X  X  X  X
    f_2_high=     X   X  X  X  3  2  1  7 6 5   11 10 9  15 14 13
    

    We can further optimize this by looking the masks we use for each source pixel. If you take a look at the shuffle masks that we use for s[1]:

    f_0_high=  X  X  X  X  X  X  X  X  X  X  X  X  3  2  1  7
    f_1_low=   6  5 11 10  9 15 14 13  X  X  X  X  X  X  X  X
    

    Since the two shuffle masks don't overlap, we can combine them and simply mask off the irrelevant pixels in combine_, which we already did! The following code performs all these optimizations (plus it assumes that the source and destination addresses are 16-byte aligned). Also, the masks are written out in code in MSB->LSB order, in case you get confused about the ordering.

    EDIT: changed the store to _mm_stream_si128 since you are likely doing a lot of writes and we don't want to necessarily flush the cache. Plus it should be aligned anyway so you get free perf!

    #include 
    #include 
    #include 
    
    // needs:
    // orig is 16-byte aligned
    // imagesize is a multiple of 4
    // dest has 4 trailing scratch bytes
    void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
        assert((uintptr_t)orig % 16 == 0);
        assert(imagesize % 16 == 0);
    
        __m128i shuf0 = _mm_set_epi8(
            -128, -128, -128, -128, // top 4 bytes are not used
            13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); // bottom 12 go to the first pixel
    
        __m128i shuf1 = _mm_set_epi8(
            7, 1, 2, 3, // top 4 bytes go to the first pixel
        -128, -128, -128, -128, // unused
            13, 14, 15, 9, 10, 11, 5, 6); // bottom 8 go to second pixel
    
        __m128i shuf2 = _mm_set_epi8(
            10, 11, 5, 6, 7, 1, 2, 3, // top 8 go to second pixel
        -128, -128, -128, -128, // unused
            13, 14, 15, 9); // bottom 4 go to third pixel
    
        __m128i shuf3 = _mm_set_epi8(
            13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3, // top 12 go to third pixel
            -128, -128, -128, -128); // unused
    
        __m128i mask0 = _mm_set_epi32(0, -1, -1, -1);
        __m128i mask1 = _mm_set_epi32(0,  0, -1, -1);
        __m128i mask2 = _mm_set_epi32(0,  0,  0, -1);
    
        uint8_t *end = orig + imagesize * 4;
        for (; orig != end; orig += 64, dest += 48) {
            __m128i a= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), shuf0);
            __m128i b= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 1), shuf1);
            __m128i c= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 2), shuf2);
            __m128i d= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 3), shuf3);
    
            _mm_stream_si128((__m128i *)dest, _mm_or_si128(_mm_and_si128(a, mask0), _mm_andnot_si128(b, mask0));
            _mm_stream_si128((__m128i *)dest + 1, _mm_or_si128(_mm_and_si128(b, mask1), _mm_andnot_si128(c, mask1));
            _mm_stream_si128((__m128i *)dest + 2, _mm_or_si128(_mm_and_si128(c, mask2), _mm_andnot_si128(d, mask2));
        }
    }
    

提交回复
热议问题