OpenCL global memory fetches

前端未结

关注

 3  1424

予麋鹿 2021-01-03 06:10

I am thinking about reworking my GPU OpenCL kernel to speed things up. The problem is there is a lot of global memory that is not coalesced and fetches are really bringing d

3条回答

孤独总比滥情好 (楼主)

2021-01-03 06:49
You can use clGetDeviceInfo to find out what the cacheline size is for a device. (clGetDeviceInfo, CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE) On many devices today, this value is typically 16 bytes.

Small reads can be troublesome, but if you are reading from the same cacheline, you should be fine. The short answer: you need to keep your 'small chunks' close together in memory to keep it fast.

I have two functions below to demonstrate two ways to access the memory -- vectorAddFoo, and vectorAddBar. The third function copySomeMemory(...) applies to your question specifically. Both vector functions have their work items add a portion of the vectors being added, but use different memory access patterns. vectorAddFoo gets each work item to process a block of vector elements, starting at its calculated position in the arrays, and moving forward through its workload. vectorAddBar has work items start at their gid and skip gSize (= global size) elements before fetching and adding the next elements.

vectorAddBar will execute faster because of the reads and writes falling into the same cacheline in memory. Every 4 float reads will fall on the same cacheline, and take only one action from the memory controller to perform. After reading a[] and b[] in this matter, all four work items will be able to do their addition, and queue their write to c[].

vectorAddFoo will guarantee the reads and writes are not in the same cacheline (except for very short vectors ~totalElements<5). Every read from a work item will require an action from the memory controller. Unless the gpu caches the following 3 floats in every case, this will result in 4x the memory access.
```
__kernel void  
vectorAddFoo(__global const float * a,  
          __global const float * b,  
          __global       float * c,
          __global const totalElements) 
{ 
  int gid = get_global_id(0); 
  int elementsPerWorkItem = totalElements/get_global_size(0);
  int start = elementsPerWorkItem * gid;

  for(int i=0;i
```
0 讨论(0) 查看其它3个回答发布评论: 提交评论加载中...