I\'m testing svd in Matlab R2014a and it seems that there is no CPU vs GPU speedup. I\'m using a GTX 460 car
The issue
First of all, I have replicated your issue in Matlab2016b using the following code:
clear all
close all
clc
Nrows = 2500;
Ncols = 2500;
NumTests = 10;
h_A = rand(Nrows, Ncols);
d_A = gpuArray.rand(Nrows, Ncols);
timingCPU = 0;
timingGPU = 0;
for k = 1 : NumTests
% --- Host
tic
[h_U, h_S, h_V] = svd(h_A);
% h_S = svd(h_A);
timingCPU = timingCPU + toc;
% --- Device
tic
[d_U, d_S, d_V] = svd(d_A);
% d_S = svd(d_A);
timingGPU = timingGPU + toc;
end
fprintf('Timing CPU = %f; Timing GPU = %f\n', timingCPU / NumTests, timingGPU / NumTests);
By the above code, it is possible to either compute the singular values only or compute the full SVD including the singular vectors. It is possible also to compare the different behavior of the CPU and GPU versions of the SVD code.
The timing is reported in the following table (timing in s; Intel Core i7-6700K CPU @ 4.00GHz, 16288 MB, Max threads(8), GTX 960):
Sing. values only | Full SVD | Sing. val. only | Full
| | |
Matrix size CPU GPU | CPU GPU | |
| | |
200 x 200 0.0021 0.043 | 0.0051 0.024 | 0.098 | 0.15
1000 x 1000 0.0915 0.3 | 0.169 0.458 | 0.5 | 2.3
2500 x 2500 3.35 2.13 | 4.62 3.97 | 2.9 | 23
5000 x 5000 5.2 13.1 | 26.6 73.8 | 16.1 | 161
The first 4 columns refer to a comparison between the CPU and GPU Matlab versions of the svd routine when it is used to calculate the singular values only or the full SVD. As it can be seen, the GPU version can be significantly slower than the GPU one. The motivation has been already pointed out in some answers above: there is an inherent difficulty to parallelize the SVD computation.
Using cuSOLVER?
At this point, the obvious question is: can we get some speedup with cuSOLVER? Indeed, we could use mexFiles to make the cuSOLVER routines run under Matlab. Unfortunately, the situation with cuSOLVER is even worse, as it can be deduced from the last two columns of the above table. Such columns report the timing of the codes at Singular values calculation only with CUDA and Parallel implementation for multiple SVDs using CUDA using cusolverDnSgesvd for the singular values only calculation and full SVD calculation, respectively. As it can be seen, cuSOLVER's cusolverDnSgesvd performs even worser than Matlab, if one takes into account that it deals with single precision, while Matlab with double precision.
The motivation for this behavior is further explained at cusolverDnCgesvd performance vs MKL where Joe Eaton, manager of cuSOLVER library, says
I understand the confusion here. We do provide a decent speedup for
LU,QRandLDL^tfactorizations, which is what we would like to say forSVDas well. Our purpose withcuSOLVERis to provide dense and sparse direct solvers as part of theCUDAtoolkit for the first time; we have to start somewhere. SinceCULAis no longer supported, we felt it was urgent to get some functionality into the hands of developers inCUDA 7.0. SinceCUDAruns on more thatx86hostCPUsthese days,cuSOLVERfills a need where there is noMKL. That being said, we can do better withSVD, but it will have to wait for the nextCUDArelease, priorities and timelines being tight already.
Using other libraries
At this point, other possibilities are using other libraries like
CULA;MAGMA;ArrayFire.CULA is not offered for free, so I have not tried it.
I had some installation issues with MAGMA dependencies, so I have not investigated this point further (disclaimer: I expect that, with some more time, I could be able to solve such issues).
I then finally ended up with using ArrayFire.
Using ArrayFire, I had the following timing for the full SVD computation:
200 x 200 0.036
1000 x 1000 0.2
2500 x 2500 4.5
5000 x 5000 29
As it can be seen, the timing is slightly higher, but now comparable, to the CPU case.
Here is the ArrayFire code:
#include
#include
#include
#include
using namespace af;
int main(int argc, char *argv[])
{
const int N = 1000;
try {
// --- Select a device and display arrayfire info
int device = argc > 1 ? atoi(argv[1]) : 0;
af::setDevice(device);
af::info();
array A = randu(N, N, f64);
af::array U, S, Vt;
// --- Warning up
timer time_last = timer::start();
af::svd(U, S, Vt, A);
S.eval();
af::sync();
double elapsed = timer::stop(time_last);
printf("elapsed time using start and stop = %g ms \n", 1000.*elapsed);
time_last = timer::start();
af::svd(U, S, Vt, A);
S.eval();
af::sync();
elapsed = timer::stop(time_last);
printf("elapsed time using start and stop = %g ms \n", 1000.*elapsed);
}
catch (af::exception& e) {
fprintf(stderr, "%s\n", e.what());
throw;
}
return 0;
}