SVD speed in CPU and GPU

后端未结

关注

 4  1426

日久生厌 2021-01-02 13:57

I\'m testing svd in Matlab R2014a and it seems that there is no CPU vs GPU speedup. I\'m using a GTX 460 car

4条回答

孤独总比滥情好 (楼主)

2021-01-02 14:46
The issue

First of all, I have replicated your issue in Matlab2016b using the following code:
```
clear all
close all
clc

Nrows = 2500;
Ncols = 2500;

NumTests = 10;

h_A = rand(Nrows, Ncols);
d_A = gpuArray.rand(Nrows, Ncols);

timingCPU = 0;
timingGPU = 0;

for k = 1 : NumTests
    % --- Host
    tic
    [h_U, h_S, h_V] = svd(h_A);
%     h_S = svd(h_A);
    timingCPU = timingCPU + toc;

    % --- Device
    tic
    [d_U, d_S, d_V] = svd(d_A);
%     d_S = svd(d_A);
    timingGPU = timingGPU + toc;
end

fprintf('Timing CPU = %f; Timing GPU = %f\n', timingCPU / NumTests, timingGPU / NumTests);
```
By the above code, it is possible to either compute the singular values only or compute the full SVD including the singular vectors. It is possible also to compare the different behavior of the CPU and GPU versions of the SVD code.

The timing is reported in the following table (timing in s; Intel Core i7-6700K CPU @ 4.00GHz, 16288 MB, Max threads(8), GTX 960):
```
              Sing. values only | Full SVD         | Sing. val. only | Full
                                |                  |                 |
Matrix size   CPU      GPU      | CPU       GPU    |                 |
                                |                  |                 |
 200 x  200   0.0021    0.043   |  0.0051    0.024 |   0.098         |  0.15
1000 x 1000   0.0915    0.3     |  0.169     0.458 |   0.5           |  2.3
2500 x 2500   3.35      2.13    |  4.62      3.97  |   2.9           |  23
5000 x 5000   5.2      13.1     | 26.6      73.8   |  16.1           | 161
```
The first 4 columns refer to a comparison between the CPU and GPU Matlab versions of the svd routine when it is used to calculate the singular values only or the full SVD. As it can be seen, the GPU version can be significantly slower than the GPU one. The motivation has been already pointed out in some answers above: there is an inherent difficulty to parallelize the SVD computation.

Using cuSOLVER?

At this point, the obvious question is: can we get some speedup with cuSOLVER? Indeed, we could use mexFiles to make the cuSOLVER routines run under Matlab. Unfortunately, the situation with cuSOLVER is even worse, as it can be deduced from the last two columns of the above table. Such columns report the timing of the codes at Singular values calculation only with CUDA and Parallel implementation for multiple SVDs using CUDA using cusolverDnSgesvd for the singular values only calculation and full SVD calculation, respectively. As it can be seen, cuSOLVER's cusolverDnSgesvd performs even worser than Matlab, if one takes into account that it deals with single precision, while Matlab with double precision.

The motivation for this behavior is further explained at cusolverDnCgesvd performance vs MKL where Joe Eaton, manager of cuSOLVER library, says

I understand the confusion here. We do provide a decent speedup for LU, QR and LDL^t factorizations, which is what we would like to say for SVD as well. Our purpose with cuSOLVER is to provide dense and sparse direct solvers as part of the CUDA toolkit for the first time; we have to start somewhere. Since CULA is no longer supported, we felt it was urgent to get some functionality into the hands of developers in CUDA 7.0. Since CUDA runs on more that x86 host CPUs these days, cuSOLVER fills a need where there is no MKL. That being said, we can do better with SVD, but it will have to wait for the next CUDA release, priorities and timelines being tight already.

Using other libraries

At this point, other possibilities are using other libraries like
1. CULA;
2. MAGMA;
3. ArrayFire.
CULA is not offered for free, so I have not tried it.

I had some installation issues with MAGMA dependencies, so I have not investigated this point further (disclaimer: I expect that, with some more time, I could be able to solve such issues).

I then finally ended up with using ArrayFire.

Using ArrayFire, I had the following timing for the full SVD computation:
```
 200 x  200      0.036
1000 x 1000      0.2
2500 x 2500      4.5
5000 x 5000     29
```
As it can be seen, the timing is slightly higher, but now comparable, to the CPU case.

Here is the ArrayFire code:
```
#include 
#include 
#include 
#include 

using namespace af;

int main(int argc, char *argv[])
{
    const int N = 1000;

    try {

        // --- Select a device and display arrayfire info
        int device = argc > 1 ? atoi(argv[1]) : 0;
        af::setDevice(device);
        af::info();

        array A = randu(N, N, f64);
        af::array U, S, Vt;

        // --- Warning up
        timer time_last = timer::start();
        af::svd(U, S, Vt, A);
        S.eval();
        af::sync();
        double elapsed = timer::stop(time_last);
        printf("elapsed time using start and stop = %g ms \n", 1000.*elapsed);

        time_last = timer::start();
        af::svd(U, S, Vt, A);
        S.eval();
        af::sync();
        elapsed = timer::stop(time_last);
        printf("elapsed time using start and stop = %g ms \n", 1000.*elapsed);

    }
    catch (af::exception& e) {

        fprintf(stderr, "%s\n", e.what());
        throw;
    }

    return 0;
}
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...