How to pack bits (efficiently) in CUDA?

后端 未结 2 1494
孤独总比滥情好
孤独总比滥情好 2020-12-20 01:34

I have an array of bytes where each byte is either 0 or 1. Now I want to pack these values into bits, so that 8 original bytes occupy 1 target byte, with original byte 0 goi

相关标签:
2条回答
  • 2020-12-20 01:53

    For two bits per thread, using uint2 *pOutput

    int lane = tid % warpSize;
    uint2 target;
    target.x = __ballot(__shfl(packing[tid], lane / 2)                & (lane & 1) + 1));
    target.y = __ballot(__shfl(packing[tid], lane / 2 + warpSize / 2) & (lane & 1) + 1));
    pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = target;
    

    You'll have to benchmark whether this is still faster than your conventional solution.

    0 讨论(0)
  • 2020-12-20 02:13

    The __ballot() warp-voting function comes quite handy for this. Assuming that you can redefine pOutput to be of uint32_t type, and that your block size is a multiple of the warp size (32):

    unsigned int target = __ballot(packing[tid]);
    if (tid % warpSize == 0) {
        pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = target;
    }
    

    Strictly speaking, the if conditional isn't even necessary, as all threads of the warp will write the same data to the same address. So a highly optimized version would just be

    pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = __ballot(packing[tid]);
    
    0 讨论(0)
提交回复
热议问题