How can I accelerate a sparse matrix by dense vector product, currently implemented via scipy.sparse.csc_matrix.dot, using CUDA?

问题

My ultimate goal is to accelerate the computation of a matrix-vector product in Python, potentially by using a CUDA-enabled GPU. The matrix A is about 15k x 15k and sparse (density ~ 0.05), and the vector x is 15k elements and dense, and I am computing Ax. I have to perform this computation many times, so making it as fast as possible would be ideal.

My current non-GPU “optimization” is to represent A as a scipy.sparse.csc_matrix object, and then simply computing A.dot(x), but I was hoping to speed this up on a VM with a couple NVIDIA GPUs attached, and using only Python if possible (i.e. not writing out the detailed kernel functions by hand). I’ve succeeded in accelerating dense matrix-vector products using the cudamat library, but not for the sparse case. There are a handful of suggestions for the sparse case online, such as using pycuda, or scikit-cuda, or anaconda’s accelerate package, but there’s not a ton of information so it’s hard to know where to begin.

I don’t need greatly detailed instructions, but if anyone has solved this before and could provide a “big picture” roadmap for the simplest way of doing this, or has an idea of the sort of speed up a sparse GPU-based matrix-vector product would have over scipy’s sparse algorithms, that would be very helpful.

回答1:

As pointed out in comments, NVIDIA ship the cuSPARSE library which includes functions for sparse matrix products with dense vectors.

Numba now has Python bindings for the cuSparse library via the pyculib package.

回答2:

Thanks for the suggestions.

I managed to get pyculib’s csrmm (matrix multiplication for compressed sparse row formatted matrices) operation to work using the following (using 2 NVIDIA K80 GPUs on Google Cloud Platform), but unfortunately wasn’t able to achieve a speedup.

I assume this is because most of the time in the csrmm function is spent transferring data to/from the GPU, as opposed to actually doing the computations. Unfortunately, I couldn’t figure out any straightforward pyculib way to get the arrays onto the GPU on the first place and keep them there over iterations. The code I used is:

import numpy as np
from scipy.sparse import csr_matrix
from pyculib.sparse import Sparse
from time import time


def spmv_cuda(a_sparse, b, sp, count):
    """Compute a_sparse x b."""

    # args to csrmm call
    trans_a = 'N'  # non-transpose, use 'T' for transpose or 'C' for conjugate transpose
    m = a_sparse.shape[0]  # num rows in a
    n = b.shape[1]  # num cols in b, c
    k = a_sparse.shape[1]  # num cols in a
    nnz = len(a_sparse.data)  # num nonzero in a
    alpha = 1  # no scaling
    descr_a = sp.matdescr(  # matrix descriptor
        indexbase=0,  # 0-based indexing
        matrixtype='G',  # 'general': no symmetry or triangular structure
    )
    csr_val_a = a_sparse.data  # csr data
    csr_row_ptr_a = a_sparse.indptr  # csr row pointers
    csr_col_ind_a = a_sparse.indices  # csr col idxs
    ldb = b.shape[0]
    beta = 0
    c = np.empty((m, n), dtype=a_sparse.dtype)
    ldc = b.shape[0]

    # call function
    tic = time()
    for ii in range(count):
        sp.csrmm(
            transA=trans_a,
            m=m,
            n=n,
            k=k,
            nnz=nnz,
            alpha=alpha,
            descrA=descr_a,
            csrValA=csr_val_a,
            csrRowPtrA=csr_row_ptr_a,
            csrColIndA=csr_col_ind_a,
            B=b,
            ldb=ldb,
            beta=beta,
            C=c,
            ldc=ldc)
    toc = time()

    return c, toc - tic

# run benchmark
COUNT = 20
N = 5000
P = 0.1

print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(np.float32)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)

b = np.random.rand(N, 1).astype(np.float32)

sp = Sparse()

# scipy sparse
tic = time()
for ii in range(COUNT):
    c = a_sparse.dot(b)
toc = time()

print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}'.format(c[:5, 0]))

# pyculib sparse

c, t = spmv_cuda(a_sparse, b, sp, COUNT)

print('pyculib sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}'.format(c[:5, 0]))

which yields the output:

Constructing objects...

scipy sparse matrix multiplication took 0.05158638954162598 seconds

c = [ 122.29484558  127.83656311  128.75004578  130.69120789  124.98323059]

Testing pyculib sparse matrix multiplication...

pyculib sparse matrix multiplication took 0.12598299980163574 seconds

c = [ 122.29483032  127.83659363  128.75003052  130.6912384   124.98326111]

As you can see, pyculib is more than twice as slow, even though the matrix multiplication is on the GPU. Again, probably because of overhead involved in transferring data to/from GPU at each iteration.

An alternative solution I found, however, was to use Andreas Kloeckner’s pycuda library, which yielded a 50x speed up!

import numpy as np
import pycuda.autoinit
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
from pycuda.sparse.packeted import PacketedSpMV
from pycuda.tools import DeviceMemoryPool
from scipy.sparse import csr_matrix
from time import time


def spmv_cuda(a_sparse, b, count):

    dtype = a_sparse.dtype
    m = a_sparse.shape[0]

    print('moving objects to GPU...')

    spmv = PacketedSpMV(a_sparse, is_symmetric=False, dtype=dtype)

    dev_pool = DeviceMemoryPool()
    d_b = gpuarray.to_gpu(b, dev_pool.allocate)
    d_c = gpuarray.zeros(m, dtype=dtype, allocator=d_b.allocator)

    print('executing spmv operation...\n')

    tic = time()
    for ii in range(count):
        d_c.fill(0)
        d_c = spmv(d_b, d_c)
    toc = time()

    return d_c.get(), toc - tic


# run benchmark
COUNT = 100
N = 5000
P = 0.1
DTYPE = np.float32

print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(DTYPE)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)

b = np.random.rand(N, 1).astype(DTYPE)

# numpy dense
tic = time()
for ii in range(COUNT):
    c = np.dot(a_dense, b)
toc = time()

print('numpy dense matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))

# scipy sparse
tic = time()
for ii in range(COUNT):
    c = a_sparse.dot(b)
toc = time()

print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))

# pycuda sparse
c, t = spmv_cuda(a_sparse, b, COUNT)
print('pycuda sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}\n'.format(c[:5]))

which yields this output:

numpy dense matrix multiplication took 0.2290663719177246 seconds

c = [ 122.29484558  127.83656311  128.75004578  130.69120789  124.98323059]

scipy sparse matrix multiplication took 0.24468040466308594 seconds

c = [ 122.29484558  127.83656311  128.75004578  130.69120789  124.98323059]

moving objects to GPU...
executing spmv operation...

pycuda sparse matrix multiplication took 0.004545450210571289 seconds

c = [ 122.29484558  127.83656311  128.75004578  130.69120789  124.98323059]

Note 1: pycuda requires the following dependencies:

pymetis: install using: pip install pymetis
nvcc: install using: sudo apt install nvidia-cuda-toolkit

Note 2: for some reason pip install pycuda fails to install the file pkt_build_cython.pyx, so you’ll need to download/copy it yourself from https://github.com/inducer/pycuda/blob/master/pycuda/sparse/pkt_build_cython.pyx.

回答3:

Another solution is to use tensorflow's matrix multiplication functions. Once GPU-enabled tensorflow is up and running, these work out-of-the-box.

After installing CUDA and tensorflow-gpu (a couple of involved but straightforward tutorials are here and here), you can use tensorflow's SparseTensor class and sparse_tensor_dense_matmul function as follows:

import numpy as np
import tensorflow as tf
from tensorflow.python.client import device_lib
from time import time

Make sure GPU is detected:

gpus = [x.name for x in device_lib.list_local_devices() if x.device_type == 'GPU']
print('GPU DEVICES:\n  {}'.format(gpus))

Output:

GPU DEVICES:
  ['/device:GPU:0']

Benchmarks:

from scipy.sparse import csr_matrix

ITERS = 30
N = 20000
P = 0.1  # matrix density

Using scipy:

np.random.seed(0)

a_dense = np.random.rand(N, N)
a_dense[a_dense > P] = 0
a_sparse = csr_matrix(a_dense)

b = np.random.rand(N)

tic = time()
for ii in range(ITERS):
    c = a_sparse.dot(b)
toc = time()

elapsed = toc - tic

print('Scipy spmv product took {} seconds per iteration.'.format(elapsed/ITERS))

Output:

Scipy spmv product took 0.06693172454833984 seconds per iteration.

Using GPU-enabled tensorflow:

with tf.device('/device:GPU:0'):

    np.random.seed(0)

    a_dense = np.random.rand(N, N)
    a_dense[a_dense > P] = 0

    indices = np.transpose(a_dense.nonzero())
    values = a_dense[indices[:, 0], indices[:, 1]]
    dense_shape = a_dense.shape

    a_sparse = tf.SparseTensor(indices, values, dense_shape)

    b = tf.constant(np.random.rand(N, 1))

    tic = time()
    for ii in range(ITERS):
        c = tf.sparse_tensor_dense_matmul(a_sparse, b)
    toc = time()

elapsed = toc - tic

print('GPU spmv product took {} seconds per iteration.'.format(elapsed/ITERS))

Output:

GPU spmv product took 0.0011811971664428711 seconds per iteration.

Quite a nice speed-up, it turns out.

回答4:

Another alternative is to use the CuPy package. It has the same interface as numpy/ scipy (wich is nice) and (for me at least), it turned out to be much easier to install than pycuda. The new code would look something like this:

import cupy as cp
from cupyx.scipy.sparse import csr_matrix as csr_gpu

A = some_sparse_matrix #(scipy.sparse.csr_matrix)
x = some_dense_vector  #(numpy.ndarray)

A_gpu = csr_gpu(A)  #moving A to the gpu
x_gpu = cp.array(x) #moving x to the gpu

for i in range(niter):
    x_gpu = A_gpu.dot(x_gpu)
x = cp.asnumpy(x_gpu) #back to numpy object for fast indexing

来源：https://stackoverflow.com/questions/49019189/how-can-i-accelerate-a-sparse-matrix-by-dense-vector-product-currently-implemen

标签

python

matrix

cuda

gpu

sparse-matrix