How to effectively normalize matrix columns in CUDA?
My matrix is stored in column-major, and the typical size is 2000x200.
The operation can be represented
You should be able to fuse the first for_each operation with the cublasSgemv call into a single reduce_by_key call. If you define/redefine functors as:
struct Accessor : public thrust::unary_function
{
int lda;
__host__ __device__ Accessor(int _lda) : lda(_lda) {};
__host__ __device__ int operator()(const int& idx)
{
return idx/lda;
}
};
struct Exp : public thrust::unary_function
{
__host__ __device__ double operator()(const double& x)
{
return exp(x);
}
};
struct Inv : public thrust::unary_function
{
__host__ __device__ double operator()(const double& x)
{
return double(1.0) / x;
}
};
You can then calculate the normalised output as
Accessor columns(m);
thrust::reduce_by_key(
thrust::make_transform_iterator(thrust::make_counting_iterator(int(0)), columns),
thrust::make_transform_iterator(thrust::make_counting_iterator(int(m*n)), columns),
thrust::make_transform_iterator(A.begin(), Exp()),
thrust::make_discard_iterator(),
sum.begin());
thrust::for_each(sum.begin(), sum.end(), Inv());
cublasDdgmm(hd, CUBLAS_SIDE_RIGHT, m, n, pA, m, pSum, 1, pA, m);
[disclaimer: all code written in browser and is untested, use at own risk]
Apart from reducing the number of kernel calls, using fancy iterators eliminates the need for the large unit matrix which should reduce memory footprint and total number of memory transactions to do the summation and exponentiation operations.