Efficiently packing 10-bit data on unaligned byte boundries

问题

I'm trying to do some bit packing on multiples that don't align to byte boundries. Here's specifically what I'm trying to do.

I have a 512-bit array (8 64-bit integers) of data. Inside that array is 10-bit data aligned to 2 bytes. What I need to do is strip that 512-bits down to 320-bits of just the 10-bit data (5 64-bit integers).

I can think of the manual way to do that where I go through each 2-byte sections of the 512-bit array, mask out the 10-bits, or it together taking into account byte boundaries and create the output 64-bit integers. something like this:

void pack512to320bits(uint64 (&array512bits)[8], uint64 (&array320bits)[5])
{
    array320bits[0] = (array512bits[0] & maskFor10bits) | ((array512bits[0] & (maskFor10bits << 16)) << 10) | 
                  ((array512bits[0] & (maskFor10bits << 32)) << 20) | ((array512bits[0] << 48) << 30) | 
                  ((arrayFor512bits[1] & (maskFor10bits)) << 40) | ((arrayFor512bits[1] & (maskFor10bits << 16)) << 50) |
                  ((arrayFor512bits[1] & (0xF << 32)) << 60);
    array320bits[1] = 0;
    array320bits[2] = 0;
    array320bits[3] = 0;
    array320bits[4] = 0;
}

I know this will work but it seems error prone and doesn't easily expand to larger byte sequences.

Alternatively I could go through the input array, strip out all the 10 bit values into a vector, then concatenate them at the end, again making sure I align to byte boundaries. Something like this:

void pack512to320bits(uint64 (&array512bits)[8], uint64 (&array320bits)[5])
{
    static uint64 maskFor10bits = 0x3FF;
    std::vector<uint16> maskedPixelBytes(8 * 4);

    for (unsigned int qword = 0; qword < 8; ++qword)
    {
        for (unsigned int pixelBytes = 0; pixelBytes < 4; ++pixelBytes)
        {
        maskedPixelBytes[qword * 4 + pixelBytes] = (array512bits[qword] & (maskFor10bits << (16 * pixelbytes)));
        } 
    }
    array320bits[0] = maskedPixelBytes[0] | (maskedPixelBytes[1] << 10) | (maskedPixelBytes[2] << 20) | (maskedPixelBytes[3] << 30) |
                  (maskedPixelBytes[4] << 40) | (maskedPixelBytes[5] << 50) | (maskedPixelBytes[6] << 60);
    array320bits[1] = (maskedPixelBytes[6] >> 4) | (maskedPixelBytes[7] << 6) ...


    array320bits[2] = 0;
    array320bits[3] = 0;
    array320bits[4] = 0;
}

This way is a little easier to debug/read but is inefficient and again can't be expanded to larger byte sequences. I'm wondering if there is an easier/algorithmic way to do this sort of bit packing.

回答1:

What you want can be done, but it depends on certain conditions and what you consider efficient.

First, if the 2 arrays will always be 1 512-bit and 1 320-bit array, that is, if the arrays being passed will always be uint64 (&array512bits)[8] and uint64 (&array320bits)[5], then it's actually orders of magnitude more efficient to hard code the padding.

If you wanted to take larger byte sequences into account though, you could create an algorithm that takes the padding into account and shifts off the bits accordingly then iterating through the uint64 values of the larger bit array. Going with this method however, introduces branches in the assembly that add computational time (e.g. if (total_shifted < bit_size), etc.). Even with optimizations on, the generated assembly would still be more complex than manually doing the shifts, as well, the code to do this would need to account for the size of each array to ensure they can fit into each other appropriately thus adding more compute time (or general code complexity).

As an example, consider this manual shift code:

static void pack512to320_manual(uint64 (&a512)[8], uint64 (&a320)[5])
{
    a320[0] = (
        (a512[0] & 0x00000000000003FF)         | // 10 -> 10
        ((a512[0] & 0x0000000003FF0000) >> 6)  | // 10 -> 20
        ((a512[0] & 0x000003FF00000000) >> 12) | // 10 -> 30
        ((a512[0] & 0x03FF000000000000) >> 18) | // 10 -> 40
        ((a512[1] & 0x00000000000003FF) << 40) | // 10 -> 50
        ((a512[1] & 0x0000000003FF0000) << 34) | // 10 -> 60
        ((a512[1] & 0x0000000F00000000) << 28)); // 4  -> 64

    a320[1] = (
        ((a512[1] & 0x000003F000000000) >> 36) | // 6  -> 6
        ((a512[1] & 0x03FF000000000000) >> 42) | // 10 -> 16
        ((a512[2] & 0x00000000000003FF) << 16) | // 10 -> 26
        ((a512[2] & 0x0000000003FF0000) << 10) | // 10 -> 36
        ((a512[2] & 0x000003FF00000000) << 4)  | // 10 -> 46
        ((a512[2] & 0x03FF000000000000) >> 2)  | // 10 -> 56
        ((a512[3] & 0x00000000000000FF) << 56)); // 8  -> 64

    a320[2] = (
        ((a512[3] & 0x0000000000000300) >> 8)  | // 2  -> 2
        ((a512[3] & 0x0000000003FF0000) >> 14) | // 10 -> 12
        ((a512[3] & 0x000003FF00000000) >> 20) | // 10 -> 22
        ((a512[3] & 0x03FF000000000000) >> 26) | // 10 -> 32
        ((a512[4] & 0x00000000000003FF) << 32) | // 10 -> 42
        ((a512[4] & 0x0000000003FF0000) << 26) | // 10 -> 52
        ((a512[4] & 0x000003FF00000000) << 20) | // 10 -> 62
        ((a512[4] & 0x0003000000000000) << 14)); // 2  -> 64

    a320[3] = (
        ((a512[4] & 0x03FC000000000000) >> 50) | // 8  -> 8
        ((a512[5] & 0x00000000000003FF) << 8)  | // 10 -> 18
        ((a512[5] & 0x0000000003FF0000) << 2)  | // 10 -> 28
        ((a512[5] & 0x000003FF00000000) >> 4)  | // 10 -> 38
        ((a512[5] & 0x03FF000000000000) >> 10) | // 10 -> 48
        ((a512[6] & 0x00000000000003FF) << 48) | // 10 -> 58
        ((a512[6] & 0x00000000003F0000) << 42)); // 6  -> 64

    a320[4] = (
        ((a512[6] & 0x0000000003C00000) >> 22) | // 4  -> 4
        ((a512[6] & 0x000003FF00000000) >> 28) | // 10 -> 14
        ((a512[6] & 0x03FF000000000000) >> 34) | // 10 -> 24
        ((a512[7] & 0x00000000000003FF) << 24) | // 10 -> 34
        ((a512[7] & 0x0000000003FF0000) << 18) | // 10 -> 44
        ((a512[7] & 0x000003FF00000000) << 12) | // 10 -> 54
        ((a512[7] & 0x03FF000000000000) << 6));  // 10 -> 64
}

This code will only accept arrays of uint64 types that will fit into each other with the 10-bit boundary taken into account and shifts accordingly such that the 512-bit array is packed into the 320-bit array, so doing something like uint64* a512p = a512; pack512to320_manual(a512p, a320); will fail at compile time since a512p is not a uint64 (&)[8] (i.e. type-safety). Note that this code is expanded fully to show the bit shifting sequences, but you could use #define's or an enum to avoid "magic numbers" and make the code potentially clearer.

If you wanted to expand this to take larger byte sequences into account, you could do something like the following:

template < std::size_t X, std::size_t Y >
static void pack512to320_loop(const uint64 (&array512bits)[X], uint64 (&array320bits)[Y])
{
    const uint64* start = array512bits;
    const uint64* end = array512bits + (X-1);
    uint64 tmp = *start;
    uint64 tmask = 0;
    int i = 0, tot = 0, stot = 0, rem = 0, z = 0;
    bool excess = false;
    while (start <= end) {
        while (stot < bit_size) {
            array320bits[i] |= ((tmp & 0x00000000000003FF) << tot);
            tot += 10; // increase shift left by 10 bits
            tmp = tmp >> 16; // shift off 2 bytes
            stot += 16; // increase shifted total
            if ((excess = ((tot + 10) >= bit_size))) { break; }
        }
        if (stot == bit_size) {
            tmp = *(++start); // get next value
            stot = 0;
        }
        if (excess) {
            rem = (bit_size - tot); // remainder bits to shift off
            tot = 0;
            // create the mask
            tmask = 0;
            for (z = 0; z < rem; ++z) { tmask |= (1 << z); }
            // get the last bits
            array320bits[i++] |= ((tmp & tmask) << (bit_size - rem));
            // shift off and adjust
            tmp = tmp >> rem;
            rem = (10 - rem);
            // new mask
            tmask = 0;
            for (z = 0; z < rem; ++z) { tmask |= (1 << z); }
            array320bits[i] = (tmp & tmask);

            tot += rem; // increase shift left by remainder bits
            tmp = tmp >> (rem + 6); // shift off 2 bytes
            stot += 16;
            excess = false;
        }
    }
}

This code also takes the byte boundaries into account and packs them into the 512-bit array. This code, however, does not do any error checking to ensure the sizes will properly match, so if X % 8 != 0 and Y % 5 != 0 (where X and Y > 0), you could get invalid results! Additionally, it's much slower than the manual version due to the looping, temporaries and shifting involved, as well, it could take more time for a first time reader of the function code to decipher the full intent and context of the loop code vs. that of the bit-shifting version.

If you want something in-between the two, you could use the manual packing function and iterate over the larger byte arrays in groups of 8 and 5 to ensure the bytes align properly; something similar to the following:

template < std::size_t X, std::size_t Y >
static void pack512to320_manual_loop(const uint64 (&array512bits)[X], uint64 (&array320bits)[Y])
{
    if (((X == 0) || (X % 8 != 0)) || ((Y == 0) || (Y % 5 != 0)) || ((X < Y) || (Y % X != Y))) {
        // handle invalid sizes how you need here
        std::cerr << "Invalid sizes!" << std::endl;
        return;
    }
    uint64* a320 = array320bits;
    const uint64* end = array512bits + (X-1);
    for (const uint64* a512 = array512bits; a512 < end; a512 += 8) {
        *a320 = (
            (a512[0] & 0x00000000000003FF)         | // 10 -> 10
            ((a512[0] & 0x0000000003FF0000) >> 6)  | // 10 -> 20
            ((a512[0] & 0x000003FF00000000) >> 12) | // 10 -> 30
            ((a512[0] & 0x03FF000000000000) >> 18) | // 10 -> 40
            ((a512[1] & 0x00000000000003FF) << 40) | // 10 -> 50
            ((a512[1] & 0x0000000003FF0000) << 34) | // 10 -> 60
            ((a512[1] & 0x0000000F00000000) << 28)); // 4  -> 64
        ++a320;

        *a320 = (
            ((a512[1] & 0x000003F000000000) >> 36) | // 6  -> 6
            ((a512[1] & 0x03FF000000000000) >> 42) | // 10 -> 16
            ((a512[2] & 0x00000000000003FF) << 16) | // 10 -> 26
            ((a512[2] & 0x0000000003FF0000) << 10) | // 10 -> 36
            ((a512[2] & 0x000003FF00000000) << 4)  | // 10 -> 46
            ((a512[2] & 0x03FF000000000000) >> 2)  | // 10 -> 56
            ((a512[3] & 0x00000000000000FF) << 56)); // 8  -> 64
        ++a320;

        *a320 = (
            ((a512[3] & 0x0000000000000300) >> 8)  | // 2  -> 2
            ((a512[3] & 0x0000000003FF0000) >> 14) | // 10 -> 12
            ((a512[3] & 0x000003FF00000000) >> 20) | // 10 -> 22
            ((a512[3] & 0x03FF000000000000) >> 26) | // 10 -> 32
            ((a512[4] & 0x00000000000003FF) << 32) | // 10 -> 42
            ((a512[4] & 0x0000000003FF0000) << 26) | // 10 -> 52
            ((a512[4] & 0x000003FF00000000) << 20) | // 10 -> 62
            ((a512[4] & 0x0003000000000000) << 14)); // 2  -> 64
        ++a320;

        *a320 = (
            ((a512[4] & 0x03FC000000000000) >> 50) | // 8  -> 8
            ((a512[5] & 0x00000000000003FF) << 8)  | // 10 -> 18
            ((a512[5] & 0x0000000003FF0000) << 2)  | // 10 -> 28
            ((a512[5] & 0x000003FF00000000) >> 4)  | // 10 -> 38
            ((a512[5] & 0x03FF000000000000) >> 10) | // 10 -> 48
            ((a512[6] & 0x00000000000003FF) << 48) | // 10 -> 58
            ((a512[6] & 0x00000000003F0000) << 42)); // 6  -> 64
        ++a320;

        *a320 = (
            ((a512[6] & 0x0000000003C00000) >> 22) | // 4  -> 4
            ((a512[6] & 0x000003FF00000000) >> 28) | // 10 -> 14
            ((a512[6] & 0x03FF000000000000) >> 34) | // 10 -> 24
            ((a512[7] & 0x00000000000003FF) << 24) | // 10 -> 34
            ((a512[7] & 0x0000000003FF0000) << 18) | // 10 -> 44
            ((a512[7] & 0x000003FF00000000) << 12) | // 10 -> 54
            ((a512[7] & 0x03FF000000000000) << 6));  // 10 -> 64
        ++a320;
    }
}

This is similar to the manual packing function and only adds a trivial amount of time for the checks but can handle larger arrays that will pack into each other cleanly (again, expanded to show the sequence).

Timing the examples above with g++ 4.2.1 using -O3 on an i7@2.2GHz yielded these average times:

pack512to320_loop: 0.135 us

pack512to320_manual: 0.0017 us

pack512to320_manual_loop: 0.0020 us

And here is the test code used to test the input/output and general timing:

#include <iostream>
#include <ctime>
#if defined(_MSC_VER)
    #include <cstdint>
    #include <windows.h>
    #define timesruct LARGE_INTEGER
    #define dotick(v) QueryPerformanceCounter(&v)
    timesruct freq;
#else
    #define timesruct struct timespec
    #define dotick(v) clock_gettime(CLOCK_MONOTONIC, &v)
#endif

static const std::size_t bit_size = sizeof(uint64) * 8;

template < std::size_t X, std::size_t Y >
static void pack512to320_loop(const uint64 (&array512bits)[X], uint64 (&array320bits)[Y])
{
    const uint64* start = array512bits;
    const uint64* end = array512bits + (X-1);
    uint64 tmp = *start;
    uint64 tmask = 0;
    int i = 0, tot = 0, stot = 0, rem = 0, z = 0;
    bool excess = false;
    // this line is only here for validities sake,
    // it was commented out during testing for performance
    for (z = 0; z < Y; ++z) { array320bits[z] = 0; }
    while (start <= end) {
        while (stot < bit_size) {
            array320bits[i] |= ((tmp & 0x00000000000003FF) << tot);
            tot += 10; // increase shift left by 10 bits
            tmp = tmp >> 16; // shift off 2 bytes
            stot += 16; // increase shifted total
            if ((excess = ((tot + 10) >= bit_size))) { break; }
        }
        if (stot == bit_size) {
            tmp = *(++start); // get next value
            stot = 0;
        }
        if (excess) {
            rem = (bit_size - tot); // remainder bits to shift off
            tot = 0;
            // create the mask
            tmask = 0;
            for (z = 0; z < rem; ++z) { tmask |= (1 << z); }
            // get the last bits
            array320bits[i++] |= ((tmp & tmask) << (bit_size - rem));
            // shift off and adjust
            tmp = tmp >> rem;
            rem = (10 - rem);
            // new mask
            tmask = 0;
            for (z = 0; z < rem; ++z) { tmask |= (1 << z); }
            array320bits[i] = (tmp & tmask);

            tot += rem; // increase shift left by remainder bits
            tmp = tmp >> (rem + 6); // shift off 2 bytes
            stot += 16;
            excess = false;
        }
    }
}

template < std::size_t X, std::size_t Y >
static void pack512to320_manual_loop(const uint64 (&array512bits)[X], uint64 (&array320bits)[Y])
{
    if (((X == 0) || (X % 8 != 0)) || ((Y == 0) || (Y % 5 != 0)) || ((X < Y) || (Y % X != Y))) {
        // handle invalid sizes how you need here
        std::cerr << "Invalid sizes!" << std::endl;
        return;
    }
    uint64* a320 = array320bits;
    const uint64* end = array512bits + (X-1);
    for (const uint64* a512 = array512bits; a512 < end; a512 += 8) {
        *a320 = (
            (a512[0] & 0x00000000000003FF)         | // 10 -> 10
            ((a512[0] & 0x0000000003FF0000) >> 6)  | // 10 -> 20
            ((a512[0] & 0x000003FF00000000) >> 12) | // 10 -> 30
            ((a512[0] & 0x03FF000000000000) >> 18) | // 10 -> 40
            ((a512[1] & 0x00000000000003FF) << 40) | // 10 -> 50
            ((a512[1] & 0x0000000003FF0000) << 34) | // 10 -> 60
            ((a512[1] & 0x0000000F00000000) << 28)); // 4  -> 64
        ++a320;

        *a320 = (
            ((a512[1] & 0x000003F000000000) >> 36) | // 6  -> 6
            ((a512[1] & 0x03FF000000000000) >> 42) | // 10 -> 16
            ((a512[2] & 0x00000000000003FF) << 16) | // 10 -> 26
            ((a512[2] & 0x0000000003FF0000) << 10) | // 10 -> 36
            ((a512[2] & 0x000003FF00000000) << 4)  | // 10 -> 46
            ((a512[2] & 0x03FF000000000000) >> 2)  | // 10 -> 56
            ((a512[3] & 0x00000000000000FF) << 56)); // 8  -> 64
        ++a320;

        *a320 = (
            ((a512[3] & 0x0000000000000300) >> 8)  | // 2  -> 2
            ((a512[3] & 0x0000000003FF0000) >> 14) | // 10 -> 12
            ((a512[3] & 0x000003FF00000000) >> 20) | // 10 -> 22
            ((a512[3] & 0x03FF000000000000) >> 26) | // 10 -> 32
            ((a512[4] & 0x00000000000003FF) << 32) | // 10 -> 42
            ((a512[4] & 0x0000000003FF0000) << 26) | // 10 -> 52
            ((a512[4] & 0x000003FF00000000) << 20) | // 10 -> 62
            ((a512[4] & 0x0003000000000000) << 14)); // 2  -> 64
        ++a320;

        *a320 = (
            ((a512[4] & 0x03FC000000000000) >> 50) | // 8  -> 8
            ((a512[5] & 0x00000000000003FF) << 8)  | // 10 -> 18
            ((a512[5] & 0x0000000003FF0000) << 2)  | // 10 -> 28
            ((a512[5] & 0x000003FF00000000) >> 4)  | // 10 -> 38
            ((a512[5] & 0x03FF000000000000) >> 10) | // 10 -> 48
            ((a512[6] & 0x00000000000003FF) << 48) | // 10 -> 58
            ((a512[6] & 0x00000000003F0000) << 42)); // 6  -> 64
        ++a320;

        *a320 = (
            ((a512[6] & 0x0000000003C00000) >> 22) | // 4  -> 4
            ((a512[6] & 0x000003FF00000000) >> 28) | // 10 -> 14
            ((a512[6] & 0x03FF000000000000) >> 34) | // 10 -> 24
            ((a512[7] & 0x00000000000003FF) << 24) | // 10 -> 34
            ((a512[7] & 0x0000000003FF0000) << 18) | // 10 -> 44
            ((a512[7] & 0x000003FF00000000) << 12) | // 10 -> 54
            ((a512[7] & 0x03FF000000000000) << 6));  // 10 -> 64
        ++a320;
    }
}

static void pack512to320_manual(uint64 (&a512)[8], uint64 (&a320)[5])
{
    a320[0] = (
        (a512[0] & 0x00000000000003FF)         | // 10 -> 10
        ((a512[0] & 0x0000000003FF0000) >> 6)  | // 10 -> 20
        ((a512[0] & 0x000003FF00000000) >> 12) | // 10 -> 30
        ((a512[0] & 0x03FF000000000000) >> 18) | // 10 -> 40
        ((a512[1] & 0x00000000000003FF) << 40) | // 10 -> 50
        ((a512[1] & 0x0000000003FF0000) << 34) | // 10 -> 60
        ((a512[1] & 0x0000000F00000000) << 28)); // 4  -> 64

    a320[1] = (
        ((a512[1] & 0x000003F000000000) >> 36) | // 6  -> 6
        ((a512[1] & 0x03FF000000000000) >> 42) | // 10 -> 16
        ((a512[2] & 0x00000000000003FF) << 16) | // 10 -> 26
        ((a512[2] & 0x0000000003FF0000) << 10) | // 10 -> 36
        ((a512[2] & 0x000003FF00000000) << 4)  | // 10 -> 46
        ((a512[2] & 0x03FF000000000000) >> 2)  | // 10 -> 56
        ((a512[3] & 0x00000000000000FF) << 56)); // 8  -> 64

    a320[2] = (
        ((a512[3] & 0x0000000000000300) >> 8)  | // 2  -> 2
        ((a512[3] & 0x0000000003FF0000) >> 14) | // 10 -> 12
        ((a512[3] & 0x000003FF00000000) >> 20) | // 10 -> 22
        ((a512[3] & 0x03FF000000000000) >> 26) | // 10 -> 32
        ((a512[4] & 0x00000000000003FF) << 32) | // 10 -> 42
        ((a512[4] & 0x0000000003FF0000) << 26) | // 10 -> 52
        ((a512[4] & 0x000003FF00000000) << 20) | // 10 -> 62
        ((a512[4] & 0x0003000000000000) << 14)); // 2  -> 64

    a320[3] = (
        ((a512[4] & 0x03FC000000000000) >> 50) | // 8  -> 8
        ((a512[5] & 0x00000000000003FF) << 8)  | // 10 -> 18
        ((a512[5] & 0x0000000003FF0000) << 2)  | // 10 -> 28
        ((a512[5] & 0x000003FF00000000) >> 4)  | // 10 -> 38
        ((a512[5] & 0x03FF000000000000) >> 10) | // 10 -> 48
        ((a512[6] & 0x00000000000003FF) << 48) | // 10 -> 58
        ((a512[6] & 0x00000000003F0000) << 42)); // 6  -> 64

    a320[4] = (
        ((a512[6] & 0x0000000003C00000) >> 22) | // 4  -> 4
        ((a512[6] & 0x000003FF00000000) >> 28) | // 10 -> 14
        ((a512[6] & 0x03FF000000000000) >> 34) | // 10 -> 24
        ((a512[7] & 0x00000000000003FF) << 24) | // 10 -> 34
        ((a512[7] & 0x0000000003FF0000) << 18) | // 10 -> 44
        ((a512[7] & 0x000003FF00000000) << 12) | // 10 -> 54
        ((a512[7] & 0x03FF000000000000) << 6));  // 10 -> 64
}

template < std::size_t N >
static void printit(uint64 (&arr)[N])
{
    for (std::size_t i = 0; i < N; ++i) {
        std::cout << "arr[" << i << "] = " << arr[i] << std::endl;
    }
}

static double elapsed_us(timesruct init, timesruct end)
{
    #if defined(_MSC_VER)
        if (freq.LowPart == 0) { QueryPerformanceFrequency(&freq); }
        return (static_cast<double>(((end.QuadPart - init.QuadPart) * 1000000)) / static_cast<double>(freq.QuadPart));
    #else
        return ((end.tv_sec - init.tv_sec) * 1000000) + (static_cast<double>((end.tv_nsec - init.tv_nsec)) / 1000);
    #endif
}

int main(int argc, char* argv[])
{
    uint64 val = 0x039F039F039F039F;
    uint64 a512[] = { val, val, val, val, val, val, val, val };
    uint64 a320[] = { 0, 0, 0, 0, 0 };
    int max_cnt = 1000000;
    timesruct init, end;
    std::cout << std::hex;

    dotick(init);
    for (int i = 0; i < max_cnt; ++i) {
        pack512to320_loop(a512, a320);
    }
    dotick(end);
    printit(a320);
    // rough estimate of timing / divide by iterations
    std::cout << "avg. us = " << (elapsed_us(init, end) / max_cnt) << " us" << std::endl;

    dotick(init);
    for (int i = 0; i < max_cnt; ++i) {
        pack512to320_manual(a512, a320);
    }
    dotick(end);
    printit(a320);
    // rough estimate of timing / divide by iterations
    std::cout << "avg. us = " << (elapsed_us(init, end) / max_cnt) << " us" << std::endl;

    dotick(init);
    for (int i = 0; i < max_cnt; ++i) {
        pack512to320_manual_loop(a512, a320);
    }
    dotick(end);
    printit(a320);
    // rough estimate of timing / divide by iterations
    std::cout << "avg. us = " << (elapsed_us(init, end) / max_cnt) << " us" << std::endl;

    return 0;
}

Again, this is just generic test code and your results might vary.

Hope that can help.

来源：https://stackoverflow.com/questions/34775546/efficiently-packing-10-bit-data-on-unaligned-byte-boundries

标签

c++

bit-manipulation