Fast search and replace some nibble in int [c; microoptimisation]

廉价感情. 提交于 2019-12-03 16:00:57

This seemed like a fun question, so I wrote a solution without looking at other answers. This appears to be about 4.9x as fast on my system. On my system, it's also slightly faster than DigitalRoss's solution (~25% faster).

static inline uint32_t nibble_replace_2(uint32_t x)
{
    uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
    uint32_t y = (~(ONES * SEARCH)) ^ x;
    y &= y >> 2;
    y &= y >> 1;
    y &= ONES;
    y *= 15; /* This is faster than y |= y << 1; y |= y << 2; */
    return x ^ (((SEARCH ^ REPLACE) * ONES) & y);
}

I would explain how it works, but... I think explaining it spoils the fun.

Note on SIMD: This kind of stuff is very, very easy to vectorize. You don't even have to know how to use SSE or MMX. Here is how I vectorized it:

static void nibble_replace_n(uint32_t *restrict p, uint32_t n)
{
    uint32_t i;
    for (i = 0; i < n; ++i) {
        uint32_t x = p[i];
        uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
        uint32_t y = (~(ONES * SEARCH)) ^ x;
        y &= y >> 2;
        y &= y >> 1;
        y &= ONES;
        y *= 15;
        p[i] = x ^ (((SEARCH ^ REPLACE) * ONES) & y);
    }
}

Using GCC, this function will automatically be converted to SSE code at -O3, assuming proper use of the -march flag. You can pass -ftree-vectorizer-verbose=2 to GCC to ask it to print out which loops are vectorized, e.g.:

$ gcc -std=gnu99 -march=native -O3 -Wall -Wextra -o opt opt.c
opt.c:66: note: LOOP VECTORIZED.

Automatic vectorization gave me an extra speed gain of about 64%, and I didn't even have to reach for the processor manual.

Edit: I noticed an additional 48% speedup by changing the types in the auto-vectorized version from uint32_t to uint16_t. This brings the total speedup to about 12x over the original. Changing to uint8_t causes vectorization to fail. I suspect there's some significant extra speed to be found with hand assembly, if it's that important.

Edit 2: Changed *= 7 to *= 15, this invalidates the speed tests.

Edit 3: Here's a change that is obvious in retrospect:

static inline uint32_t nibble_replace_2(uint32_t x)
{
    uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
    uint32_t y = (~(ONES * SEARCH)) ^ x;
    y &= y >> 2;
    y &= y >> 1;
    y &= ONES;
    return x ^ (y * (SEARCH ^ REPLACE));
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!