Elementwise operations in OpenCL (Cuda)

问题

I build a kernel for elementwise multiplication of two matrices, but at least with my configurations my OpenCL kernel is only faster when each matrices is larger than 2GB. So I was wondering, if it is because of my naive kernel (see below) or because of the nature of elementwise operations, meaning that elementwise operations dont gain from using GPUs.

Thanks for your input!

kernel:

KERNEL_CODE = """
// elementwise multiplication: C = A .* B.
__kernel void matrixMul(
        __global float* C,
        __global float* A,
        __global float* B,
        int width, int height)
{
    // ID
    int x = get_global_id(0);
    int y = get_global_id(1);

    // Multiplying
    C[y * height + x ] = A[y * height + x] * B[y * height + x];
}
"""

p.s. I read some experts think, CUDA is too different from OpenCL to answer for both in the same question, fell free to remove it from title and tags.

回答1:

That sort of operation has N FLOPs, but 3N memory transactions, so it will be completely memory bandwidth bound. There is no scope for data re-use, so the upper bound of speed up over the reference CPU version is the ratio of GPU to CPU bandwidth. That number is rarely more than 10 times, and can get eroded pretty quickly by the cost of moving the data to and from GPU memory. Generally speaking, this sort of operation is best "fused" with other O(N) operations to improve performance. You would usually never just compute the Hadamard product in a single kernel, you would do it as part of a series of O(N) operations within one kernel. So, no, this is not a great candidate for speed up, even if the kernel were optimal.

And your kernel definitely isn't. You are doing 3 IOPs for every FLOP, which is a huge penalty. You could definitely do things to improve this, but what things will depend completely on what sort of hardware this is going to run on.

回答2:

Speaking about the element-wise operations: this depends on the device. For example NVidia GPUs use scalar processors (with scalar instructions), no vectorization is necessary. On the contrary, ATI features 5d (or 4d) VLIW processors and for these is the vectorization crucial. However, it can be sometimes performed by the compiler rather than using vector data types directly in code, but it is a first thing to do when optimizing for ATI's GPUs.

Nevertheless, as talonmies pointed out, the algorithm above is hardly memory-bandwidth-bound and you can't await much speedup using the GPU solely for it.

回答3:

The kernel you posted should be at least as fast as a CPU one. But you are not using coalesced memory accesses at all!

This is killing your performance.

However, as @talonmies stated. This is not a good case for a GPU. You are loosing all your time in memory copy.

来源：https://stackoverflow.com/questions/6045473/elementwise-operations-in-opencl-cuda

标签

cuda

opencl

gpu-programming