How to normalize matrix columns in CUDA with max performance?

前端 未结 3 450
清歌不尽
清歌不尽 2020-12-09 06:28

How to effectively normalize matrix columns in CUDA?

My matrix is stored in column-major, and the typical size is 2000x200.

The operation can be represented

3条回答
  •  被撕碎了的回忆
    2020-12-09 07:09

    You should be able to fuse the first for_each operation with the cublasSgemv call into a single reduce_by_key call. If you define/redefine functors as:

    struct Accessor : public thrust::unary_function
    {
        int lda;
        __host__ __device__ Accessor(int _lda) : lda(_lda) {};
        __host__ __device__ int operator()(const int& idx)
        {
            return idx/lda;
        }
    };
    
    struct Exp : public thrust::unary_function
    {
        __host__ __device__ double operator()(const double& x)
        {
            return exp(x);
        }
    };
    
    struct Inv : public thrust::unary_function
    {
        __host__ __device__ double operator()(const double& x)
        {
            return double(1.0) / x;
        }
    };
    

    You can then calculate the normalised output as

    Accessor columns(m);
    thrust::reduce_by_key(
            thrust::make_transform_iterator(thrust::make_counting_iterator(int(0)), columns),
            thrust::make_transform_iterator(thrust::make_counting_iterator(int(m*n)), columns),
            thrust::make_transform_iterator(A.begin(), Exp()),
            thrust::make_discard_iterator(),
            sum.begin());
    
    thrust::for_each(sum.begin(), sum.end(), Inv());
    
    cublasDdgmm(hd, CUBLAS_SIDE_RIGHT, m, n, pA, m, pSum, 1, pA, m);
    

    [disclaimer: all code written in browser and is untested, use at own risk]

    Apart from reducing the number of kernel calls, using fancy iterators eliminates the need for the large unit matrix which should reduce memory footprint and total number of memory transactions to do the summation and exponentiation operations.

提交回复
热议问题