SVD speed in CPU and GPU

后端 未结 4 1426
日久生厌
日久生厌 2021-01-02 13:57

I\'m testing svd in Matlab R2014a and it seems that there is no CPU vs GPU speedup. I\'m using a GTX 460 car

4条回答
  •  孤独总比滥情好
    2021-01-02 14:46

    The issue

    First of all, I have replicated your issue in Matlab2016b using the following code:

    clear all
    close all
    clc
    
    Nrows = 2500;
    Ncols = 2500;
    
    NumTests = 10;
    
    h_A = rand(Nrows, Ncols);
    d_A = gpuArray.rand(Nrows, Ncols);
    
    timingCPU = 0;
    timingGPU = 0;
    
    for k = 1 : NumTests
        % --- Host
        tic
        [h_U, h_S, h_V] = svd(h_A);
    %     h_S = svd(h_A);
        timingCPU = timingCPU + toc;
    
        % --- Device
        tic
        [d_U, d_S, d_V] = svd(d_A);
    %     d_S = svd(d_A);
        timingGPU = timingGPU + toc;
    end
    
    fprintf('Timing CPU = %f; Timing GPU = %f\n', timingCPU / NumTests, timingGPU / NumTests);
    

    By the above code, it is possible to either compute the singular values only or compute the full SVD including the singular vectors. It is possible also to compare the different behavior of the CPU and GPU versions of the SVD code.

    The timing is reported in the following table (timing in s; Intel Core i7-6700K CPU @ 4.00GHz, 16288 MB, Max threads(8), GTX 960):

                  Sing. values only | Full SVD         | Sing. val. only | Full
                                    |                  |                 |
    Matrix size   CPU      GPU      | CPU       GPU    |                 |
                                    |                  |                 |
     200 x  200   0.0021    0.043   |  0.0051    0.024 |   0.098         |  0.15
    1000 x 1000   0.0915    0.3     |  0.169     0.458 |   0.5           |  2.3
    2500 x 2500   3.35      2.13    |  4.62      3.97  |   2.9           |  23
    5000 x 5000   5.2      13.1     | 26.6      73.8   |  16.1           | 161
    

    The first 4 columns refer to a comparison between the CPU and GPU Matlab versions of the svd routine when it is used to calculate the singular values only or the full SVD. As it can be seen, the GPU version can be significantly slower than the GPU one. The motivation has been already pointed out in some answers above: there is an inherent difficulty to parallelize the SVD computation.

    Using cuSOLVER?

    At this point, the obvious question is: can we get some speedup with cuSOLVER? Indeed, we could use mexFiles to make the cuSOLVER routines run under Matlab. Unfortunately, the situation with cuSOLVER is even worse, as it can be deduced from the last two columns of the above table. Such columns report the timing of the codes at Singular values calculation only with CUDA and Parallel implementation for multiple SVDs using CUDA using cusolverDnSgesvd for the singular values only calculation and full SVD calculation, respectively. As it can be seen, cuSOLVER's cusolverDnSgesvd performs even worser than Matlab, if one takes into account that it deals with single precision, while Matlab with double precision.

    The motivation for this behavior is further explained at cusolverDnCgesvd performance vs MKL where Joe Eaton, manager of cuSOLVER library, says

    I understand the confusion here. We do provide a decent speedup for LU, QR and LDL^t factorizations, which is what we would like to say for SVD as well. Our purpose with cuSOLVER is to provide dense and sparse direct solvers as part of the CUDA toolkit for the first time; we have to start somewhere. Since CULA is no longer supported, we felt it was urgent to get some functionality into the hands of developers in CUDA 7.0. Since CUDA runs on more that x86 host CPUs these days, cuSOLVER fills a need where there is no MKL. That being said, we can do better with SVD, but it will have to wait for the next CUDA release, priorities and timelines being tight already.

    Using other libraries

    At this point, other possibilities are using other libraries like

    1. CULA;
    2. MAGMA;
    3. ArrayFire.

    CULA is not offered for free, so I have not tried it.

    I had some installation issues with MAGMA dependencies, so I have not investigated this point further (disclaimer: I expect that, with some more time, I could be able to solve such issues).

    I then finally ended up with using ArrayFire.

    Using ArrayFire, I had the following timing for the full SVD computation:

     200 x  200      0.036
    1000 x 1000      0.2
    2500 x 2500      4.5
    5000 x 5000     29
    

    As it can be seen, the timing is slightly higher, but now comparable, to the CPU case.

    Here is the ArrayFire code:

    #include 
    #include 
    #include 
    #include 
    
    using namespace af;
    
    int main(int argc, char *argv[])
    {
        const int N = 1000;
    
        try {
    
            // --- Select a device and display arrayfire info
            int device = argc > 1 ? atoi(argv[1]) : 0;
            af::setDevice(device);
            af::info();
    
            array A = randu(N, N, f64);
            af::array U, S, Vt;
    
            // --- Warning up
            timer time_last = timer::start();
            af::svd(U, S, Vt, A);
            S.eval();
            af::sync();
            double elapsed = timer::stop(time_last);
            printf("elapsed time using start and stop = %g ms \n", 1000.*elapsed);
    
            time_last = timer::start();
            af::svd(U, S, Vt, A);
            S.eval();
            af::sync();
            elapsed = timer::stop(time_last);
            printf("elapsed time using start and stop = %g ms \n", 1000.*elapsed);
    
        }
        catch (af::exception& e) {
    
            fprintf(stderr, "%s\n", e.what());
            throw;
        }
    
        return 0;
    }
    

提交回复
热议问题