OpenCL global memory fetches

前端 未结 3 1424
予麋鹿
予麋鹿 2021-01-03 06:10

I am thinking about reworking my GPU OpenCL kernel to speed things up. The problem is there is a lot of global memory that is not coalesced and fetches are really bringing d

3条回答
  •  孤独总比滥情好
    2021-01-03 06:49

    You can use clGetDeviceInfo to find out what the cacheline size is for a device. (clGetDeviceInfo, CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE) On many devices today, this value is typically 16 bytes.

    Small reads can be troublesome, but if you are reading from the same cacheline, you should be fine. The short answer: you need to keep your 'small chunks' close together in memory to keep it fast.

    I have two functions below to demonstrate two ways to access the memory -- vectorAddFoo, and vectorAddBar. The third function copySomeMemory(...) applies to your question specifically. Both vector functions have their work items add a portion of the vectors being added, but use different memory access patterns. vectorAddFoo gets each work item to process a block of vector elements, starting at its calculated position in the arrays, and moving forward through its workload. vectorAddBar has work items start at their gid and skip gSize (= global size) elements before fetching and adding the next elements.

    vectorAddBar will execute faster because of the reads and writes falling into the same cacheline in memory. Every 4 float reads will fall on the same cacheline, and take only one action from the memory controller to perform. After reading a[] and b[] in this matter, all four work items will be able to do their addition, and queue their write to c[].

    vectorAddFoo will guarantee the reads and writes are not in the same cacheline (except for very short vectors ~totalElements<5). Every read from a work item will require an action from the memory controller. Unless the gpu caches the following 3 floats in every case, this will result in 4x the memory access.

    __kernel void  
    vectorAddFoo(__global const float * a,  
              __global const float * b,  
              __global       float * c,
              __global const totalElements) 
    { 
      int gid = get_global_id(0); 
      int elementsPerWorkItem = totalElements/get_global_size(0);
      int start = elementsPerWorkItem * gid;
    
      for(int i=0;i

提交回复
热议问题