Bitwise operation in C language (0x80, 0xFF,

问题

I have a problem understanding this code. What I know is that we have passed a code into a assembler that has converted code into "byte code". Now I have a Virtual machine that is supposed to read this code. This function is supposed to read the first byte code instruction. I don't understand what is happening in this code. I guess we are trying to read this byte code but don't understand how it is done.

static int32_t  bytecode_to_int32(const uint8_t *bytecode, size_t size)
{
    int32_t result;
    t_bool  sign;
    int     i;

    result = 0;
    sign = (t_bool)(bytecode[0] & 0x80);
    i = 0;
    while (size)
    {
        if (sign)
            result += ((bytecode[size - 1] ^ 0xFF) << (i++ * 8));
        else
            result += bytecode[size - 1] << (i++ * 8);
        size--;
    }
    if (sign)
        result = ~(result);
    return (result);
}

回答1:

This code is somewhat badly written, lots of operations on a single line and therefore containing various potential bugs. It looks brittle.

bytecode[0] & 0x80 Simply reads the MSB sign bit, assuming it's 2's complement or similar, then converts it to a boolean.
The loop iterates backwards from most significant byte to least significant.
If the sign was negative, the code will perform an XOR of the data byte with 0xFF. Basically inverting all bits in the data. The result of the XOR is an int.
The data byte (or the result of the above XOR) is then bit shifted i * 8 bits to the left. The data is always implicitly promoted to int, so in case i * 8 happens to give a result larger than INT_MAX, there's a fat undefined behavior bug here. It would be much safer practice to cast to uint32_t before the shift, carry out the shift, then convert to a signed type afterwards.
The resulting int is converted to int32_t - these could be the same type or different types depending on system.
i is incremented by 1, size is decremented by 1.
If sign was negative, the int32_t is inverted to some 2's complement negative number that's sign extended and all the data bits are inverted once more. Except all zeros that got shifted in with the left shift are also replaced by ones. If this is intentional or not, I cannot tell. So for example if you started with something like 0x0081 you now have something like 0xFFFF01FF. How that format makes sense, I have no idea.

My take is that the bytecode[size - 1] ^ 0xFF (which is equivalent to ~) was made to toggle the data bits, so that they would later toggle back to their original values when ~ is called later. A programmer has to document such tricks with comments, if they are anything close to competent.

Anyway, don't use this code. If the intention was merely to swap the byte order (endianess) of a 4 byte integer, then this code must be rewritten from scratch.

That's properly done as:

static int32_t big32_to_little32 (const uint8_t* bytes)
{
  uint32_t result = (uint32_t)bytes[0] << 24 | 
                    (uint32_t)bytes[1] << 16 | 
                    (uint32_t)bytes[2] <<  8 | 
                    (uint32_t)bytes[3] <<  0 ; 

  return (int32_t)result;
}

Anything more complicated than the above is highly questionable code. We need not worry about signs being a special case, the above code preserves the original signedness format.

回答2:

So the A^0xFF toggles the bits set in A, so if you have 10101100 xored with 11111111.. it will become 01010011. I am not sure why they didn't use ~ here. The ^ is a xor operator, so you are xoring with 0xFF.

The << is a bitshift "up" or left. In other words, A<<1 is equivalent to multiplying A by 2.

the >> moves down so is equivalent to bitshifting right, or dividing by 2.

The ~ inverts the bits in a byte.

Note it's better to initialise variables at declaration it costs no additional processing whatsoever to do it that way.

sign = (t_bool)(bytecode[0] & 0x80); the sign in the number is stored in the 8th bit (or position 7 counting from 0), which is where the 0x80 is coming from. So it's literally checking if the signed bit is set in the first byte of bytecode, and if so then it stores it in the sign variable.

Essentially if it's unsigned then it's copying the bytes from from bytecode into result one byte at a time.

If the data is signed then it flips the bits then copies the bytes, then when it's done copying, it flips the bits back.

Personally with this kind of thing i prefer to get the data, stick in htons() format (network byte order) and then memcpy it to an allocated array, store it in a endian agnostic way, then when i retrieve the data i use ntohs() to convert it back to the format used by the computer. htons() and ntohs() are standard C functions and are used in networking and platform agnostic data formatting / storage / communication all the time.

回答3:

This function is a very naive version of the function which converts form the big endian to little endian.

The parameter size is not needed as it works only with the 4 bytes data.

It can be much easier archived by the union punning (and it allows compilers to optimize it - in this case to the simple instruction):

#define SWAP(a,b,t)    do{t c = (a); (a) = (b); (b) = c;}while(0)

int32_t my_bytecode_to_int32(const uint8_t *bytecode)
{
    union 
    {
        int32_t i32;
        uint8_t b8[4];
    }i32;
    uint8_t b;

    i32.b8[3] = *bytecode++;
    i32.b8[2] = *bytecode++;
    i32.b8[1] = *bytecode++;
    i32.b8[0] = *bytecode++;

    return i32.i32;
}

int main()
{
    union {
        int32_t i32;
        uint8_t b8[4];
    }i32;
    uint8_t b;


    i32.i32 = -4567;
    SWAP(i32.b8[0], i32.b8[3], uint8_t);
    SWAP(i32.b8[1], i32.b8[2], uint8_t);

    printf("%d\n", bytecode_to_int32(i32.b8, 4));


    i32.i32 = -34;
    SWAP(i32.b8[0], i32.b8[3], uint8_t);
    SWAP(i32.b8[1], i32.b8[2], uint8_t);

    printf("%d\n", my_bytecode_to_int32(i32.b8));
}

https://godbolt.org/z/rb6Na5

回答4:

If the purpose of the code is to sign-extend a 1-, 2-, 3-, or 4-byte sequence in network/big-endian byte order to a signed 32-bit int value, it's doing things the hard way and reimplementing the wheel along the way.

This can be broken down into a three-step process: convert the proper number of bytes to a 32-bit integer value, sign-extend bytes out to 32 bits, then convert that 32-bit value from big-endian to the host's byte order.

The "wheel" being reimplemented in this case is the the POSIX-standard ntohl() function that converts a 32-bit unsigned integer value in big-endian/network byte order to the local host's native byte order.

The first step I'd do is to convert 1, 2, 3, or 4 bytes into a uint32_t:

#include <stdint.h>
#include <limits.h>
#include <arpa/inet.h>
#include <errno.h>

// convert the `size` number of bytes starting at the `bytecode` address
// to a uint32_t value
static uint32_t bytecode_to_uint32( const uint8_t *bytecode, size_t size )
{
    uint32_t result = 0;

    switch ( size )
    {
    case 4:
        result = bytecode[ 0 ] << 24;
    case 3:
        result += bytecode[ 1 ] << 16;
    case 2:
        result += bytecode[ 2 ] << 8;
    case 1:
        result += bytecode[ 3 ];
        break;
    default:
        // error handling here
        break;
    }

    return( result );
}

Then, sign-extend it (borrowing from this answer):

static uint32_t sign_extend_uint32( uint32_t in, size_t size );
{
    if ( size == 4 )
    {
        return( in );
    }

    // being pedantic here - the existence of `[u]int32_t` pretty
    // much ensures 8 bits/byte
    size_t bits = size * CHAR_BIT;

    uint32_t m = 1U << ( bits - 1 );

    uint32_t result = ( in ^ m ) - m;
    return ( result );
}

Put it all together:

static int32_t  bytecode_to_int32( const uint8_t *bytecode, size_t size )
{
    uint32_t result = bytecode_to_uint32( bytecode, size );

    result = sign_extend_uint32( result, size );

    // set endianness from network/big-endian to
    // whatever this host's endianness is
    result = ntohl( result );

    // converting uint32_t here to signed int32_t
    // can be subject to implementation-defined
    // behavior
    return( result );
}

Note that the conversion from uint32_t to int32_t implicitly performed by the return statement in the above code can result in implemenation-defined behavior as there can be uint32_t values that can not be mapped to int32_t values. See this answer.

Any decent compiler should optimize that well into inline functions.

I personally think this also needs much better error handling/input validation.

来源：https://stackoverflow.com/questions/60092227/bitwise-operation-in-c-language-0x80-0xff

标签