Matrix-vector multiplication in CUDA: benchmarking & performance

天大地大妈咪最大 提交于 2019-11-28 20:47:39

First, let me write down the full working Matrix-Vector multiplication kernel employing shared memory:

template<typename T>
__global__ void matvec_kernel(const T * __restrict__ dA, const T * __restrict__ dx, T * __restrict__ dy, const unsigned int nRows, const unsigned int nCols)
{
    const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

    __shared__ T x_shared[BLOCK_SIZE];

    T y_val = 0.0;

    #pragma unroll
    for (unsigned int m = 0; m < ((nCols + BLOCK_SIZE - 1)/ BLOCK_SIZE); ++m)
    {
        if ((m * BLOCK_SIZE + threadIdx.x) <  nCols) x_shared[threadIdx.x] = dx[threadIdx.x + m * BLOCK_SIZE];
        else                                         x_shared[threadIdx.x] = 0.f;
        __syncthreads();

        #pragma unroll
        for (unsigned int e = 0; e < BLOCK_SIZE; ++e) {
            // --- Column-major ordering - faster
            y_val += dA[tid + (e + BLOCK_SIZE * m) * nRows] * x_shared[e];
            // --- Row-major ordering - slower
            //y_val += dA[tid * nCols + (e + BLOCK_SIZE * m)] * x_shared[e];
        }

        __syncthreads();
    }

    if (tid < nRows) dy[tid] = y_val;

}

Unless differently specified, all the tests will be done on a GT540M card.

A first parameter to be optimized is the BLOCK_SIZE. Changing the BLOCK_SIZE changes the algorithm performance, as witnessed by the following graph:

The following graphs compares row-major ordering vs. column-major ordering. The latter is faster:

Another optimization you may wish to try is using more Instruction Level Parallelism (ILP) by this modified kernel employing ILP = 2

template<typename T>
__global__ void matvec_kernel_ILP2(const T * __restrict__ dA, const T * __restrict__ dx, T * __restrict__ dy, const unsigned int nRows, const unsigned int nCols)
{
    const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

    __shared__ T x_shared[BLOCK_SIZE];

    T y_val1 = 0.0;
    T y_val2 = 0.0;

    #pragma unroll
    for (unsigned int m = 0; m < ((nCols + BLOCK_SIZE - 1)/ BLOCK_SIZE); ++m)
    {
        if ((m * BLOCK_SIZE + threadIdx.x) <  nCols) x_shared[threadIdx.x] = dx[threadIdx.x + m * BLOCK_SIZE];
        else                                         x_shared[threadIdx.x] = 0.f;
        __syncthreads();

        #pragma unroll
        for (unsigned int e = 0; e < BLOCK_SIZE; ++e) {
            y_val1 += dA[tid + (e + BLOCK_SIZE * m) * nRows] * x_shared[e];
            y_val2 += dA[tid + gridDim.x * BLOCK_SIZE + (e + BLOCK_SIZE * m) * nRows] * x_shared[e];
        }

        __syncthreads();
    }

    if (tid < nRows) dy[tid] = y_val1;
    if ((tid + gridDim.x * BLOCK_SIZE) < nRows) dy[tid + gridDim.x * BLOCK_SIZE] = y_val2;

}

This kernel should be called with half of the threads, as

dim3 dim_grid((nRows/2 + BLOCK_SIZE -1)/ BLOCK_SIZE);
dim3 dim_block(BLOCK_SIZE);
matvec_kernel_ILP2<T> <<<dim_grid, dim_block>>>(dA, dx, dy, nRows, nx);

Finally, since you are using a device with compute capability 3.2, you can try using shuffle operations. I'm providing here the kernel using shuffle operations instead of shared memory. In this case, you should set BLOCK_SIZE = 32:

template<typename T>
__global__ void matvec_kernel_shfl(const T * __restrict__ dA, const T * __restrict__ dx, T * __restrict__ dy, const unsigned int nRows, const unsigned int nCols)
{
    const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

    T x_shfl_src, x_shfl_dest;

    T y_val = 0.0;

    #pragma unroll
    for (unsigned int m = 0; m < ((nCols + BLOCK_SIZE - 1)/ BLOCK_SIZE); ++m)
    {
        if ((m * BLOCK_SIZE + threadIdx.x) <  nCols) x_shfl_src = dx[threadIdx.x + m * BLOCK_SIZE];
        else                                         x_shfl_src = 0.f;
        __syncthreads();

//        #pragma unroll
        for (int e = 0; e < 32; ++e) {
            // --- Column-major ordering - faster
            x_shfl_dest = __shfl(x_shfl_src, e);
            y_val += dA[tid + (e + BLOCK_SIZE * m) * nRows] * x_shfl_dest;
            // --- Row-major ordering - slower
            //y_val += dA[tid * nCols + (e + BLOCK_SIZE * m)] * x_shared[e];
        }

        __syncthreads();
    }

    if (tid < nRows) dy[tid] = y_val;

}

Shuffle operations improve the performance over shared memory for BLOCK_SIZE = 32 on a Kepler K20c as shown by the graph below:

Looking at your code I think that the way you traverse the elements of A may be the problem:

for (unsigned int e = 0; e < n_star; ++e) {
  y_val += Asub[row + e * nRows] * x_shared[e];
}

So, when nRows becomes large, you actually read from the global memory (that is where A is stored) with a large stride. In particular this happens in every block: threads inside the same block will read from the global memory in a non-consecutive fashion. This can be improved if you consider storing from the beginning the values of A row-by-row (i.e., using row-major order). This is just a guess and I would have written a comment, but it requires a higher score on Stackoverflow...

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!