Arranging memory for OpenCL

问题

I have about 10 numpy arrays of n items. OpenCL worker with global id i only looks at the ith element of each array. How should I arrange the memory?

I was thinking of interleaving the arrays on the graphics card, but I'm not sure if this will have any performance gains since I don't understand the workgroup memory access pattern.

回答1:

I'm not familiar with numpy, however if:

the thread with global id i looks at ith element (as you mentioned)
the data type has a proper memory alignment (4, 8, 16)
each thread reads 32, 64, 128 bit at once

you should be able to achieve optimal memory throughput because of coalesced memory access. In this case interleaving wont bring any performance gain.

If one of the last two points is not fulfilled, and you may be able to achieve them by interleaving you could see a performance gain.

EDIT: Struct of Arrays (SoA) vs. Array of Structs (AoS)

This point could be found in literature quite often. I'll make it short:

Why is an SoA preferable to an AoS? Imagine, you have 10 arrays of a 32bit data type. The AoS solution would be as followed:

struct Data
{
   float a0;
   float a1; 
   ...
   float a9;
}; // 10 x 32bit = 320 bit 

struct Data array[512];

How will a memory read look like? The memory is ill aligned without any changes the memory transfer could not be coalesced. However, the code needed to read is quite short:

Data a = array[i];

With some luck the compiler is smart enough to, at least, merge some of the read instructions. An option would be explicit memory alignment. This will coast you global memory, which is very limited on GPUs.

Now the SoA solution:

struct Data
{
    float a0[512];
    float a1[512]; 
    ...
    float a9[512];
};

struct Data array;

The work to access the memory is a little bit more complex, however every access could be combined in a coalesced read and no memory alignment is needed. You can also just forget about the struct and use each array as it is without any performance issues.

Another thing that could be used are vectorized data types (if your numpy arrays allow this). You can use a float2, float4 (or other simple data types like int, double ...) to exploit combined memory transfers, i.e. every read to a float4 array would be coalesced in a 128 bit memory transfer maximizing memory throughput.

来源：https://stackoverflow.com/questions/18585564/arranging-memory-for-opencl

标签

opencl