Impact of matrix sparsity on cblas sgemm in Ubuntu 14.04

问题

I have recently discovered that the performance of a cblas_sgemm call for matrix multiplication dramatically improves if the matrices have a "large" number of zeros in them. It improves to the point that it beats its cublas cousin by around 100 times. This could be most probably attributed to some automatic detection of sparsity and suitable format conversion by cblas_sgemm function.

Unfortunately, no such behavior is exhibited by its cuda counterpart i.e. cublasSgemm.

So, the question is, how can I get the same type of optimization on cublasSgemm for matrices that may have a large number of zeros.

And what technique does cblas_sgemm use to automatically adjust to sparse matrices?

Please, do not recommend cuSparse / CUSP etc because

I am not sure about the sparsity of input matrices beforehand
I am working on an iterative algorithm where for initial few iterations the matrices may be sparse but gradually become dense as time goes on.

Thanks in advance

Edited to include code to reproduce the above scenario

#include <iostream>
#include <stdio.h>
#include <time.h>
#include <cblas.h>
#include <cublas_v2.h>
using namespace std;

int main()
{
const int m = 5000; 

timespec blas_start, blas_end, cublas_start, cublas_end;
long totalnsec; //total nano sec
double totalsec, totaltime;
int i, j;
float *A = new float[m]; // 1 x m
float *B = new float[m*m]; // m x m
float *C = new float[m]; // 1 x m

// input martix A: every 32nd element is non-zero
for(i = 0; i < m; i++)
{       
    A[i] = 0;
    if( i % 32 == 0)    //adjust for sparsity
        A[i] = i;
}

// input matrix B: identity matrix
// col major = row major
for(i = 0; i < m; i++)
    for(j = 0; j < m; j++)
    {
        if (i==j)           
            B[j*m + i] = 1;
        else
            B[j*m + i] = 0;     
    }

clock_gettime(CLOCK_REALTIME, &blas_start);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1, A, m, B, m, 0, C, m);
clock_gettime(CLOCK_REALTIME, &blas_end);
    /*
for(i = 0; i < 12; i++)
    printf("%f ", C[i]);
    */

//cublas section

cudaError_t cudaStat;   
cublasHandle_t handle;
cublasCreate(&handle);
//Declaring Device Variables
float *A_d, *B_d, *C_d;

//Allocating Memory for Device Variables
cudaStat = cudaMalloc(&A_d, sizeof(float)*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for A_d\n");

cudaStat = cudaMalloc(&B_d, sizeof(float)*m*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for B_d\n");

cudaStat = cudaMalloc(&C_d, sizeof(float)*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for C_d\n");

// Moving values of A, B onto Device variables
cublasSetVector(m, sizeof(float), A, 1, A_d, 1);
cublasSetMatrix(m, m, sizeof(float), B, m, B_d, m); 

// Do the actual multiplication
float alpha = 1.0f, beta = 0.0f;
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &cublas_start);

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 1, m, m, &alpha, A_d, 1, B_d, m, &beta, C_d, 1);

cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &cublas_end);

cublasGetVector(m, sizeof(float), C, 1, C_d, 1);
    /*
for(i = 0; i < 12; i++)
    printf("%f ", C[i]);
    */

// Print times
// blas time
totalsec = (double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec < 0)
{
    totalnsec += 1e9;
    totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"BLAS Time = "<< totaltime << "\n";

//cublas
totalsec = (double)cublas_end.tv_sec - (double)cublas_start.tv_sec;
totalnsec = cublas_end.tv_nsec - cublas_start.tv_nsec;
if(totalnsec < 0)
{
    totalnsec += 1e9;
    totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"CUBLAS Time = "<< totaltime << "\n";

return 0;
}

Ran it to get the following results

malang@ubuntu:~/uas/stackoverflow$ nvcc -arch=sm_12 blascomp.cu -o blascomp.o -lblas -lcublas
malang@ubuntu:~/uas/stackoverflow$ ./blascomp.o
BLAS Time = 0.000964504
CUBLAS Time = 0.0365322

EDIT

Edited after the answer of @Eric

Use of cublasSgemv has greatly enhanced the performance on the GPU. But, I still have this problem of cblas_sgemm being much more efficient for sparse matrices on the CPU. What could be the possible reasons?

EDIT Executed the following commands on the suggestion of @Eric @osgx @Robert Crovella

erisp@ubuntu:~/uas/stackoverflow$ ldd ./gemmcomp.o
    linux-gate.so.1 =>  (0xb76f6000)
    libblas.so.3 => /usr/lib/libblas.so.3 (0xb765e000)
    libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb7576000)
    libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb73c7000)
    libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7381000)
/lib/ld-linux.so.2 (0xb76f7000)
    libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7364000)

erisp@ubuntu:~/uas/stackoverflow$ ll -d /usr/lib/libblas* /etc/alternatives/libblas.*
lrwxrwxrwx 1 root root   26 مارچ  13  2015 /etc/alternatives/libblas.a -> /usr/lib/libblas/libblas.a
lrwxrwxrwx 1 root root   27 مارچ  13  2015 /etc/alternatives/libblas.so -> /usr/lib/libblas/libblas.so
lrwxrwxrwx 1 root root   29 مارچ  13  2015 /etc/alternatives/libblas.so.3 -> /usr/lib/libblas/libblas.so.3
lrwxrwxrwx 1 root root   29 مارچ  13  2015 /etc/alternatives/libblas.so.3gf -> /usr/lib/libblas/libblas.so.3
drwxr-xr-x 2 root root 4096 مارچ  13  2015 /usr/lib/libblas/
lrwxrwxrwx 1 root root   27 مارچ  13  2015 /usr/lib/libblas.a -> /etc/alternatives/libblas.a
lrwxrwxrwx 1 root root   28 مارچ  13  2015 /usr/lib/libblas.so -> /etc/alternatives/libblas.so
lrwxrwxrwx 1 root root   30 مارچ  13  2015 /usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3
lrwxrwxrwx 1 root root   32 مارچ  13  2015 /usr/lib/libblas.so.3gf -> /etc/alternatives/libblas.so.3gf

回答1:

Your code has a problem - you are using the wrong BLAS API. You use the matrix-matrix-multiplication routine gemm() to do a vector-matrix-multiplication operation.

For vec-mat-mul or mat-vec-mul you should use gemv(). Of course gemm() can give correct result with a matrix having only 1 row. But this is an unexpected corner case that gemv() should handle, so you may not get the peak performance on GPU and/or CPU.

You could change to gemv() and benchmark again.

EDIT

Here's my benchmark result with single thread MKL. Values of A and B are same as in your code. I cannot reproduce your result of '0.000964504s' on CPU. You could check the correctness of your result vector. There's a chance that your cblas library has a bug.

Using gemm()

BLAS Time = 0.0169784
CUBLAS Time = 0.00356155

Using gemv()

BLAS Time = 0.0167557
CUBLAS Time = 0.0013809

EDIT2

I now can reproduce the FAST result on unbuntu 14.04 with the package libblas-dev.

The reason is answered in the following question.

cblas gemm time dependent on input matrix values - Ubuntu 14.04

In the particular version of BLAS, there's code to check for zero element. The checking cost is O(n^2), so it is worth doing this on matrix-matrix multiplication whose cost is O(n^3).

For GPU gemm(), as the order of computing is different (block by block instead of line by line), such optimization may not be feasible. But it is doable for GPU gemv(), where the time spent on loading the matrix from global memory could be saved.

来源：https://stackoverflow.com/questions/38031089/impact-of-matrix-sparsity-on-cblas-sgemm-in-ubuntu-14-04

标签

cuda

ubuntu-14.04

blas

cublas