When is padding for shared memory really required?

前端 未结 2 474
深忆病人
深忆病人 2020-12-30 17:47

I am confused by 2 documents from NVidia. \"CUDA Best Practices\" describes that shared memory is organized in banks, and in general in 32-bit mode each 4 bytes is a bank (t

2条回答
  •  暖寄归人
    2020-12-30 18:24

    You might be interested in this webinar from the NVIDIA CUDA webinar page Shared memory including banks are described also on slides 35-45 from this webinar.

    In general shared memory bank conflicts can occur any time two different threads are attempting to access (from the same kernel instruction) locations within shared memory for which the lower 4 (pre-cc2.0 devices) or 5 bits (cc2.0 and newer devices) of the address are the same. When a bank conflict does occur, the shared memory system serializes accesses to locations that are in the same bank, thus reducing performance. Padding attempts to avoid this for some access patterns. Note that for cc2.0 and newer, if all the bits are the same (i.e. same location) this does not cause a bank conflict.

    Pictorially, we can look at it like this:

    __shared__ int A[2048];
    int my;
    my = A[0]; // A[0] is in bank 0
    my = A[1]; // A[1] is in bank 1
    my = A[2]; // A[2] is in bank 2
    ...
    my = A[31]; // A[31] is in bank 31 (cc2.0 or newer device)
    my = A[32]; // A[32] is in bank 0
    my = A[33]; // A[33] is in bank 1
    

    now, if we access shared memory across threads in a warp, we may hit bank conflicts:

    my = A[threadIdx.x];    // no bank conflicts or serialization - handled in one trans.
    my = A[threadIdx.x*2];  // 2-way bank conflicts - will cause 2 level serialization
    my = A[threadIdx.x*32]; // 32-way bank conflicts - will cause 32 level serialization
    

    Let's take a closer look at the 2-way bank conflict above. Since we are multiplying threadIdx.x by 2, thread 0 accesses location 0 in bank 0 but thread 16 accesses location 32 which is also in bank 0, thus creating a bank conflict. For the 32-way example above, all the addresses correspond to bank 0. Thus 32 transactions to shared memory must occur to satisfy this request, as they are all serialized.

    So to answer the question, if I knew that my access patterns would be like this for example:

    my = A[threadIdx.x*32]; 
    

    Then I might want pad my data storage so that A[32] is a dummy/pad location, as is A[64], A[96] etc. Then I could fetch the same data like this:

    my = A[threadIdx.x*33]; 
    

    And get my data with no bank conflicts.

    Hope this helps.

提交回复
热议问题