Fastest Method to Split a 32 Bit number into Bytes in C++

问题

I am writing a piece of code designed to do some data compression on CLSID structures. I'm storing them as a compressed stream of 128 bit integers. However, the code in question has to be able to place invalid CLSIDs into the stream. In order to do this, I have left them as one big string. On disk, it would look something like this:

+--------------------------+-----------------+------------------------+
|                          |                 |                        |
| Length of Invalid String | Invalid String  | Compressed Data Stream |
|                          |                 |                        |
+--------------------------+-----------------+------------------------+

To encode the length of the string, I need to output the 32 bit integer that is the length of the string one byte at a time. Here's my current code:

std::vector<BYTE> compressedBytes;
DWORD invalidLength = (DWORD) invalidClsids.length();
compressedBytes.push_back((BYTE)  invalidLength        & 0x000000FF);
compressedBytes.push_back((BYTE) (invalidLength >>= 8) & 0x000000FF));
compressedBytes.push_back((BYTE) (invalidLength >>= 8) & 0x000000FF));
compressedBytes.push_back((BYTE) (invalidLength >>= 8));

This code won't be called often, but there will need to be a similar structure in the decoding stage called many thousands of times. I'm curious if this is the most efficient method or if someone can come up with one better?

Thanks all!

Billy3

EDIT: After looking over some of the answers, I created this mini test program to see which was the fastest:

// temp.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <windows.h>
#include <ctime>
#include <iostream>
#include <vector>

void testAssignedShifts();
void testRawShifts();
void testUnion();

int _tmain(int argc, _TCHAR* argv[])
{
    std::clock_t startTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testAssignedShifts();
    }
    std::clock_t assignedShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testRawShifts();
    }
    std::clock_t rawShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testUnion();
    }
    std::clock_t unionFinishedTime = std::clock();
    std::printf(
        "Execution time for assigned shifts: %08u clocks\n"
        "Execution time for raw shifts:      %08u clocks\n"
        "Execution time for union:           %08u clocks\n\n",
        assignedShiftsFinishedTime - startTime,
        rawShiftsFinishedTime - assignedShiftsFinishedTime,
        unionFinishedTime - rawShiftsFinishedTime);
    startTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testAssignedShifts();
    }
    assignedShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testRawShifts();
    }
    rawShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testUnion();
    }
    unionFinishedTime = std::clock();
    std::printf(
        "Execution time for assigned shifts: %08u clocks\n"
        "Execution time for raw shifts:      %08u clocks\n"
        "Execution time for union:           %08u clocks\n\n"
        "Finished. Terminate!\n\n",
        assignedShiftsFinishedTime - startTime,
        rawShiftsFinishedTime - assignedShiftsFinishedTime,
        unionFinishedTime - rawShiftsFinishedTime);

    system("pause");
    return 0;
}

void testAssignedShifts()
{
    std::string invalidClsids("This is a test string");
    std::vector<BYTE> compressedBytes;
    DWORD invalidLength = (DWORD) invalidClsids.length();
    compressedBytes.push_back((BYTE)  invalidLength);
    compressedBytes.push_back((BYTE) (invalidLength >>= 8));
    compressedBytes.push_back((BYTE) (invalidLength >>= 8));
    compressedBytes.push_back((BYTE) (invalidLength >>= 8));
}
void testRawShifts()
{
    std::string invalidClsids("This is a test string");
    std::vector<BYTE> compressedBytes;
    DWORD invalidLength = (DWORD) invalidClsids.length();
    compressedBytes.push_back((BYTE) invalidLength);
    compressedBytes.push_back((BYTE) (invalidLength >>  8));
    compressedBytes.push_back((BYTE) (invalidLength >>  16));
    compressedBytes.push_back((BYTE) (invalidLength >>  24));
}

typedef union _choice
{
    DWORD dwordVal;
    BYTE bytes[4];
} choice;

void testUnion()
{
    std::string invalidClsids("This is a test string");
    std::vector<BYTE> compressedBytes;
    choice invalidLength;
    invalidLength.dwordVal = (DWORD) invalidClsids.length();
    compressedBytes.push_back(invalidLength.bytes[0]);
    compressedBytes.push_back(invalidLength.bytes[1]);
    compressedBytes.push_back(invalidLength.bytes[2]);
    compressedBytes.push_back(invalidLength.bytes[3]);
}

Running this a few times results in:

Execution time for assigned shifts: 00012484 clocks
Execution time for raw shifts:      00012578 clocks
Execution time for union:           00013172 clocks

Execution time for assigned shifts: 00012594 clocks
Execution time for raw shifts:      00013140 clocks
Execution time for union:           00012782 clocks

Execution time for assigned shifts: 00012500 clocks
Execution time for raw shifts:      00012515 clocks
Execution time for union:           00012531 clocks

Execution time for assigned shifts: 00012391 clocks
Execution time for raw shifts:      00012469 clocks
Execution time for union:           00012500 clocks

Execution time for assigned shifts: 00012500 clocks
Execution time for raw shifts:      00012562 clocks
Execution time for union:           00012422 clocks

Execution time for assigned shifts: 00012484 clocks
Execution time for raw shifts:      00012407 clocks
Execution time for union:           00012468 clocks

Looks to be about a tie between assigned shifts and union. Since I'm going to need the value later, union it is! Thanks!

Billy3

回答1:

Just use a union:

assert(sizeof (DWORD) == sizeof (BYTE[4]));   // Sanity check

union either {
    DWORD dw;
    struct {
         BYTE b[4];
    } bytes;
};

either invalidLength;
invalidLength.dw = (DWORD) invalidClsids.length();
compressedBytes.push_back(either.bytes.b[0]);
compressedBytes.push_back(either.bytes.b[1]);
compressedBytes.push_back(either.bytes.b[2]);
compressedBytes.push_back(either.bytes.b[3]);

NOTE: Unlike the bit-shifting approach in the original question, this code produces endian-dependent output. This matters only if output from a program running on one computer will be read on a computer with different endianness -- but as there seems to be no measurable speed increase from using this method, you might as well use the more portable bit-shifting approach, just in case.

回答2:

This is probably as optimized as you'll get. Bit-twiddling operations are some of the fastest available on the processor.

It may be faster to >> 16, >> 24 instead of >>= 8 >>= 8 - you cut down an assignment.

Also I don't think you need the & - since you're casting to a BYTE (which should be a 8-bit char) it'll get truncated down appropriately anyway. (Is it? correct me if I'm wrong)

All in all, though, these are really minor changes. Profile it to see if it actually makes a difference :P

回答3:

You should measure rather than guess at any potential improvement but my first thought is that it may be faster to do a union as follows:

typedef union {
    DWORD d;
    struct {
        BYTE b0;
        BYTE b1;
        BYTE b2;
        BYTE b3;
    } b;
} DWB;

std::vector<BYTE> compBytes;
DWB invLen;
invLen.d = (DWORD) invalidClsids.length();
compBytes.push_back(invalidLength.b.b3);
compBytes.push_back(invalidLength.b.b2);
compBytes.push_back(invalidLength.b.b1);
compBytes.push_back(invalidLength.b.b0);

That may be the right order for the pushbacks but check just in case - it depends on the endian-ness of the CPU.

回答4:

A real quick way is to just treat the a DWORD* (single element array) as a BYTE* (4 element array). The code is also a lot more readable.

Warning: I haven't compiled this

Warning: This makes your code dependent on byte ordering

std::vector<BYTE> compressedBytes;
DWORD invalidLength = (DWORD) invalidClsids.length();
BYTE* lengthParts = &invalidLength;
static const int kLenghtPartsLength = sizeof(DWORD) / sizeof(BYTE);
for(int i = 0; i < kLenghtPartsLength; ++i)
    compressedBytes.push_back(lengthParts[i]);

回答5:

compressedBytes.push_back(either.bytes.b[0]);
compressedBytes.push_back(either.bytes.b[1]);
compressedBytes.push_back(either.bytes.b[2]);
compressedBytes.push_back(either.bytes.b[3]);

There is an even smarter and faster way! Let's see what this code is doing and how we can improve it.

This code is serializing the integer, one byte at a time. For each byte it's calling push_back, which is checking the free space in the internal vector buffer. If we have no room for another byte, memory reallocation will happen (hint, slow!). Granted, the reallocation will not happen frequently (reallocations typically happen by doubling the existing buffer). Then, the new byte is copied and the internal size is increased by one.

vector<> has a requirement by the standard which dictates that the internal buffer be contiguous. vector<> also happen to have an operator& () and operator[] ().

So, here is the best code you can come up with:

std::string invalidClsids("This is a test string");
std::vector<BYTE> compressedBytes;
DWORD invalidLength = (DWORD) invalidClsids.length();
compressedBytes.resize(sizeof(DWORD)); // You probably want to make this much larger, to avoid resizing later.
// compressedBytes is as large as the length we want to serialize.
BYTE* p = &compressedBytes[0]; // This is valid code and designed by the standard for such cases. p points to a buffer that is at least as large as a DWORD.
*((DWORD*)p) = invalidLength;  // Copy all bytes in one go!

The above cast can be done in one go with the &compressedBytes[0] statement, but it won't be faster. This is more readable.

NOTE! Serializing this way (or even with the UNION method) is endian-dependent. That is, on an Intel/AMD processor the least significant byte will come first, while one a big-endian machine (PowerPC, Motorola...) the most significant byte will come first. If you want to be neutral, you must use a math method (shifts).

回答6:

Do you have to do it one byte at a time? Is there a way you could just memcpy() the whole 32 bits into the stream in one fell swoop? If you have the address of the buffer you're writing to the stream, can you just copy into that?

回答7:

Perhaps it's possible to get 32bit variable pointer, convert it into char pointer and read char, then add +1 to pointer and read next char .. just theory :) i don't know if it's working

来源：https://stackoverflow.com/questions/741212/fastest-method-to-split-a-32-bit-number-into-bytes-in-c

标签

c++

byte