why MATLAB gpuarray is much slower in just adding two matrices?

浪尽此生 提交于 2019-12-02 07:07:46

问题


I have recently employed MATLAB CUDA library for some absolutely simple matrix calculations on gpu. But the performance results are very strange. could any body help me understand what exactly is going on and how I can solve the issue. Thanks in advance. Please note that the following codes are run on geforce GTX TITAN black gpu.

assume a0,a1,...a6 be 1000*1000 gpuarrays and U=0.5 and V=0.0

titan = gpuDevice();
tic();

for i=1:10000
a6(1,1)=(0.5.*(a5(1,1)-a0(1,1)))-(a1(1,1)+a2(1,1)+a3(1,1))-(a5(1,1).*U./3.0)-(a5(1,1).*V./2.0)+(0.25.*a5(1,1).*a4(1,1));  
end

wait(titan);
time = toc()

the result for time=17.98 seconds

now re-defining a0,a1,...a6 and U and V for employment on cpu and calculating the time needed:

tic();

for i=1:10000
a6(1,1)=(0.5.*(a5(1,1)-a0(1,1)))-(a1(1,1)+a2(1,1)+a3(1,1))-(a5(1,1).*U./3.0)-(a5(1,1).*V./2.0)+(0.25.*a5(1,1).*a4(1,1));  
end

time= toc()  

the result for time=0.0098 seconds

therefore more than 1800 times faster on cpu!!!!

then I decided to do the previous calculations on the whole matrix rather than specific elements, and here are the results:

Results for the run on gpu:

titan = gpuDevice();
tic();
for i=1:10000
a6=(0.5.*(a5-a0))-(a1+a2+a3)-(a5.*U./3.0)-(a5.*V./2.0)+(0.25.*a5.*a4);  
end
wait(titan);
time = toc()   

the result for time=6.32 seconds which means that the operation on the whole matrix is much faster than on a specific element!

Results for the run on CPU:

tic();
for i=1:10000
a6=(0.5.*(a5-a0))-(a1+a2+a3)-(a5.*U./3.0)-(a5.*V./2.0)+(0.25.*a5.*a4);  
end

time= toc()  

the result for time=35.2 seconds

AND HERE IS THE MOST SURPRISING RESULT: assuming a0,a1,...a6 and U and V to be just 1*1 gpuarrays and running the following:

titan = gpuDevice();
tic();
for i=1:10000
a6=(0.5.*(a5-a0))-(a1+a2+a3)-(a5.*U./3.0)-(a5.*V./2.0)+(0.25.*a5.*a4);  
end
wait(titan);
time = toc()  

the result for time=7.8 seconds

it is even slower than the corresponding 1000*1000 case!

Unfortunately the line a6(1,1)=(0.5.*(a5(1,1)-a0(1,1)))-(a1(1,1)+a2(1,1)+a3(1,1))-(a5(1,1).*U./3.0)-(a5(1,1).*V./2.0)+(0.25.*a5(1,1).*a4(1,1)); is one of the lines among about 100 lines, all in a single for-loop and this line proved itself as a real bottleneck taking about 50% of all calculation time needed! could anybody help me? note that transferring this part of calculations on cpu is not a choice because the bottleneck line is in a for-loop and sending a1,...a6 to cpu and calling the results to gpu in each iteration is much more time consuming. any advice is really really appreciated.


回答1:


ehsan,

Titan is powerful.

I hope the following might help.

1> GPU has many (from hundred to thousands) low frequency stream cores, which means they have to execute the same instructions. So, they are very good at SIMD instructions. If you are doing to compute only one element of a matrix (the first example and the last), GPU is definitely not good at this.

2> For the second test, please involve the index i into the expression to eliminate optimization from compiler. Or, you can try to change 10000 to 50000 to see whether there is a difference.

for i=1:10000
a6=i*(0.5.*(a5-a0))-(a1+a2+a3)-(a5.*U./3.0)-(a5.*V./2.0)+(0.25.*a5.*a4);  
end    

3> CPU has its own Vector Processing Unit (VPU), which is also aimed for SIMD. The only problem is that, it is quite small, from 64 bit to 256 bit. So, if the matrix is small, CPU is much better than GPU. Therefore, to see the performance benefit of GPU, you can try a larger dimension, say, 5000x5000.

Please let me know if you have any further results on this.




回答2:


I think your second GPU result (i.e. vectorised GPU calls) is the most pertinent - GPUs are most efficient when operating on large amounts of data in a vectorised fashion. In your case, you can probably get even better performance by converting your expression into an arrayfun call. arrayfun allows MATLAB to convert the entire expression into a single operation on the GPU, which takes best advantage of the (huge) available memory bandwidth of the device.

As to your problem calculating a6(1,1) - perhaps it might be best to calculate the whole array (i.e. don't index the right-hand-side expressions) and then index afterwards. Something like

tmp = (0.5.*(a5-a0))-(a1+a2+a3)-(a5.*U./3.0)-(a5.*V./2.0)+(0.25.*a5.*a4);
a6(1,1) = tmp(1,1);


来源:https://stackoverflow.com/questions/27363660/why-matlab-gpuarray-is-much-slower-in-just-adding-two-matrices

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!