'memcpy'-like function that supports offsets by individual bits?

问题

I was thinking about solving this, but it's looking to be quite a task. If I take this one by myself, I'll likely write it several different ways and pick the best, so I thought I'd ask this question to see if there's a good library that solves this already or if anyone has thoughts/advice.

void OffsetMemCpy(u8* pDest, u8* pSrc, u8 srcBitOffset, size size)
{
    // Or something along these lines. srcBitOffset is 0-7, so the pSrc buffer 
    // needs to be up to one byte longer than it would need to be in memcpy.
    // Maybe explicitly providing the end of the buffer is best.
    // Also note that pSrc has NO alignment assumptions at all.
}

My application is time critical so I want to nail this with minimal overhead. This is the source of the difficulty/complexity. In my case, the blocks are likely to be quite small, perhaps 4-12 bytes, so big-scale memcpy stuff (e.g. prefetch) isn't that important. The best result would be the one that benches fastest for constant 'size' input, between 4 and 12, for randomly unaligned src buffers.

Memory should be moved in word sized blocks whenever possible
Alignment of these word sized blocks is important. pSrc is unaligned, so we may need to read a few bytes off the front until it is aligned.

Anyone have, or know of, a similar implemented thing? Or does anyone want to take a stab at writing this, getting it to be as clean and efficient as possible?

Edit: It seems people are voting this "close" for "too broad". A few narrowing details would be AMD64 is the preferred architecture, so lets assume that. This means little endian etc. The implementation would hopefully fit well within the size of an answer so I don't think this is too broad. I'm asking for answers that are a single implementation at a time, even though there are a few approaches.

回答1:

I would start with a simple implementation such as this:

inline void OffsetMemCpy(uint8_t* pDest, const uint8_t* pSrc, const uint8_t srcBitOffset, const size_t size)
{
    if (srcBitOffset == 0)
    {
        for (size_t i = 0; i < size; ++i)
        {
            pDest[i] = pSrc[i];
        }
    }
    else if (size > 0)
    {
        uint8_t v0 = pSrc[0];
        for (size_t i = 0; i < size; ++i)
        {
            uint8_t v1 = pSrc[i + 1];
            pDest[i] = (v0 << srcBitOffset) | (v1 >> (CHAR_BIT - srcBitOffset));
            v0 = v1;            
        }
    }
}

(warning: untested code!).

Once this is working then profile it in your application - you may find it's plenty fast enough for your needs and thereby avoid the pitfalls of premature optimisation. If not then you have a useful baseline reference implementation for further optimisation work.

Be aware that for small copies the overhead of testing for alignment and word-sized copies etc may well outweigh any benefits, so a simple byte by byte loop such as the above may well be close to optimal.

Note also that optimisations may well be architecture-dependent - micro-optimisations which give a benefit on one CPU may well be counter-productive on another.

回答2:

I think that trivial byte-by-byte solution (see @PaulR's answer) is the best approach for small blocks, unless you can satisfy the following additional constraints:

Input buffer is allocated with some padding, i.e. accessing some bytes after the last one does not crash.
Output buffer is also allocated with some padding, and it does not matter if a few bytes after the desired result location are overwritten. If it does matter, that you'll need to do more stuff to preserve those after-the-end bytes.
Input and output ranges involved do not overlap (including a few more padding bytes after the end), just like in memcpy.

If you can, then it is possible to increase granularity of the algorithm. It is very easy to change @PaulR's answer to use uint64_t words instead of uint8_t bytes everywhere. As a result, it would work faster.

We can use SSE to further increase word size. Since in SSE there is no way to shift the whole register by bits, we have to do two shifts for 64-bit integers, then glue results together. Gluing is done by _mm_shuffle_epi8 from SSSE3, which allows to shuffle bytes in XMM register in arbitrary way. For shifting we use _mm_srl_epi64, because that's the only way to shift 64-bit integers by non-immediate number of bits. I have added restrict keyword from C (as macro) to the pointer arguments, because if they are aliased, the algorithm will not work anyway.

Here is the code:

void OffsetMemCpy_stgatilov(uint8_t *RESTRICT pDest, const uint8_t *RESTRICT pSrc, const uint8_t srcBitOffset, const size_t size) {
    __m128i bits = (sizeof(size_t) == 8 ? _mm_cvtsi64_si128(srcBitOffset) : _mm_cvtsi32_si128(srcBitOffset));
    const uint8_t *pEnd = pSrc + size;
    while (pSrc < pEnd) {
        __m128i input = _mm_loadu_si128((__m128i*)pSrc);
        __m128i reg = _mm_shuffle_epi8(input, _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 11, 12, 13, 14));
        __m128i shifted = _mm_srl_epi64(reg, bits);
        __m128i comp = _mm_shuffle_epi8(shifted, _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, -1, -1));
        _mm_storeu_si128((__m128i*)pDest, comp);
        pSrc += 14;  pDest += 14;
    }
}

It processes 14 bytes per iteration. Each iteration is rather simple, also there is some code before the loop. Here is the assembly code of the whole function body as generated by MSVC2013 x64:

    movzx   eax, r8b
    movd    xmm3, rax
    lea rax, QWORD PTR [rdx+r9]
    cmp rdx, rax
    jae SHORT $LN1@OffsetMemC
    movdqa  xmm1, XMMWORD PTR __xmm@0e0d0c0b0a0908070706050403020100
    movdqa  xmm2, XMMWORD PTR __xmm@ffff0e0d0c0b0a090806050403020100
    sub rcx, rdx
    npad    11
$LL2@OffsetMemC:
    movdqu  xmm0, XMMWORD PTR [rdx]
    add rdx, 14
    pshufb  xmm0, xmm1
    psrlq   xmm0, xmm3
    pshufb  xmm0, xmm2
    movdqu  XMMWORD PTR [rcx+rdx-14], xmm0
    cmp rdx, rax
    jb  SHORT $LL2@OffsetMemC
$LN1@OffsetMemC:
    ret 0

IACA says the whole function takes 4.5 cycles throughput and 13 cycles latency on Ivy Bridge, given that the loop is executed once and no issues with caches/branches/decoding happen. In benchmark however, 7.5 cycles are spent on one such call on average.

Here are brief results of throughput benchmark on Ivy Bridge 3.4 Ghz (see more results in the code):

(billions of calls per second)
size = 4:
  0.132  (Paul R)
  0.248  (Paul R x64)
  0.45  (stgatilov)
size = 8:
  0.0782  (Paul R)
  0.249  (Paul R x64)
  0.45  (stgatilov)
size = 12:
  0.0559  (Paul R)
  0.191  (Paul R x64)
  0.453  (stgatilov)

Note however, that in real world performance can be drastically different from benchmark results.

Full code with benchmarking and more verbose results are here.

来源：https://stackoverflow.com/questions/32043911/memcpy-like-function-that-supports-offsets-by-individual-bits

标签

c++

optimization

bit-manipulation

memcpy