问题
I am thinking about reworking my GPU OpenCL kernel to speed things up. The problem is there is a lot of global memory that is not coalesced and fetches are really bringing down the performance. So I am planning to copy as much of the global memory into local but I have to pick what to copy.
Now my question is: Do many fetches of small chunks of memory hurt more than fewer fetches of larger chunks?
回答1:
You can use clGetDeviceInfo to find out what the cacheline size is for a device. (clGetDeviceInfo, CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE) On many devices today, this value is typically 16 bytes.
Small reads can be troublesome, but if you are reading from the same cacheline, you should be fine. The short answer: you need to keep your 'small chunks' close together in memory to keep it fast.
I have two functions below to demonstrate two ways to access the memory -- vectorAddFoo, and vectorAddBar. The third function copySomeMemory(...) applies to your question specifically. Both vector functions have their work items add a portion of the vectors being added, but use different memory access patterns. vectorAddFoo gets each work item to process a block of vector elements, starting at its calculated position in the arrays, and moving forward through its workload. vectorAddBar has work items start at their gid and skip gSize (= global size) elements before fetching and adding the next elements.
vectorAddBar will execute faster because of the reads and writes falling into the same cacheline in memory. Every 4 float reads will fall on the same cacheline, and take only one action from the memory controller to perform. After reading a[] and b[] in this matter, all four work items will be able to do their addition, and queue their write to c[].
vectorAddFoo will guarantee the reads and writes are not in the same cacheline (except for very short vectors ~totalElements<5). Every read from a work item will require an action from the memory controller. Unless the gpu caches the following 3 floats in every case, this will result in 4x the memory access.
__kernel void
vectorAddFoo(__global const float * a,
__global const float * b,
__global float * c,
__global const totalElements)
{
int gid = get_global_id(0);
int elementsPerWorkItem = totalElements/get_global_size(0);
int start = elementsPerWorkItem * gid;
for(int i=0;i<elementsPerWorkItem;i++){
c[start+i] = a[start+i] + b[start+i];
}
}
__kernel void
vectorAddBar(__global const float * a,
__global const float * b,
__global float * c,
__global const totalElements)
{
int gid = get_global_id(0);
int gSize = get_global_size(0);
for(int i=gid;i<totalElements;i+=gSize){
c[i] = a[i] + b[i];
}
}
__kernel void
copySomeMemory(__global const int * src,
__global const count,
__global const position)
{
//copy 16kb of integers to local memory, starting at 'position'
int start = position + get_local_id(0);
int lSize = get_local_size(0);
__local dst[4096];
for(int i=0;i<4096;i+=lSize ){
dst[start+i] = src[start+i];
}
barrier(CLK_GLOBAL_MEM_FENCE);
//use dst here...
}
回答2:
In general, fewer fecthes of larger size will be more efficient. I can't give you specific advice without seeing your code, but make sure to access sequential chunks from the work-items to enable 'streaming'. Do any transpositions or random memory accesses after you bring the data into local memory.
回答3:
I am not able to understand you question properly , but if you have large global access and if those are re-used than use use local memory.
Note:small local work size less data shared so no use, large local work size less parallel threads . So you need to select the best one.
来源:https://stackoverflow.com/questions/9912385/opencl-global-memory-fetches