Reading from an unaligned uint8_t recast as a uint32_t array - not getting all values

一世执手 提交于 2019-12-02 03:22:41

If you want bytes 2..6, you're going to have to combine multiple aligned loads to get what you want.

uint32_t *ptr = ...;
uint32_t value = (ptr[0] >> 16) | (ptr[1] << 16);

Technically, this is also the portable way to do things in C in general, but we're all spoiled because you don't have to do the extra work on x86, ARM, Power, or other common architectures.

tera

While unaligned accesses are not allowed in CUDA, the prmt PTX instruction has a handy mode to emulate the effect of unaligned reads within registers. This can be exposed with a bit of inline PTX assembly. If you can tolerate a read past the end of the array, the code becomes quite simple:

// WARNING! Reads past ptr!
__device__ uint32_t read_unaligned(void* ptr)
{
    uint32_t result;
    asm("{\n\t"
        "   .reg .b64    aligned_ptr;\n\t"
        "   .reg .b32    low, high, alignment;\n\t"
        "   and.b64      aligned_ptr, %1, 0xfffffffffffffffc;\n\t"
        "   ld.u32       low, [aligned_ptr];\n\t"
        "   ld.u32       high, [aligned_ptr+4];\n\t"
        "   cvt.u32.u64  alignment, %1;\n\t"
        "   prmt.b32.f4e %0, low, high, alignment;\n\t"
        "}"
        : "=r"(result) : "l"(ptr));
    return result;
}

To ensure the access past the end of the array remains harmless, round up the number of allocated byte to a multiple of 4, and add another 4 bytes.

Above device code has the same effect as the following code on a little-endian host that tolerates unaligned accesses:

__host__ uint32_t read_unaligned_host(void* ptr)
{
    return *(uint32_t*)ptr;
}

As @DietrichEpp suggests, you'll have to make two loads; and as @tera suggests, you can combine these two loads generically for cheap even when the misalignment is not known in advance (i.e. when the initial address of uint8Array is arbitrary) using the prmt PTX instruction.

I'll offer a solution based on @tera's which will let you do:

value = read_unaligned(&uint8Array[offset]);

safely and (relatively) efficiently. Also, it will only have one inline PTX assembly instruction, and an "unsafe" variant if you need it:

#include <cstdint>
#include <cuda_runtime_api.h>

__device__ __forceinline__ uint32_t prmt_forward_4_extract(
    uint32_t first_word,
    uint32_t second_word, 
    uint32_t control_bits)
{
    uint32_t result;
    asm("prmt.b32.f4e %0, %1, %2, %3;"
        : "=r"(result)
        : "r"(first_word), "r"(second_word), "r"(control_bits) );
    return result;
}

/*
 * This unsafe, faster variant may read past the 32-bit naturally-aligned
 * word containing the last relevant byte
 */
__device__ inline uint32_t read_unaligned_unsafe(const uint32_t* __restrict__ ptr)
{
    /*
     *  Clear the bottom 2 bits of the address, making the result aligned 
     *  for the purposes of reading a 32-bit (= 4-byte) value
     */
    auto aligned_ptr  = (uint32_t*) ((uint64_t) ptr & ~((uint64_t) 0x3));
    auto first_value  = *aligned_ptr;
    auto second_value = *(aligned_ptr + 1);

    auto lower_word_of_ptr = (uint32_t)((uint64_t)(ptr));

    return prmt_forward_4_extract(first_value, second_value, lower_word_of_ptr);
}

__device__ inline uint32_t read_unaligned(const uint32_t* __restrict__ ptr)
{
    auto ptr_is_already_aligned = ((uint64_t)(ptr) & 0x3 == 0);
    if (ptr_is_already_aligned) { return *ptr; }
    return read_unaligned_unsafe(ptr);
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!