问题
When I run the code shown below, the tic/toc pair inside the function shows it takes very short time (<< 1sec) to go through all the lines. However, it actually takes around 2.3secs to get the outputs!!! I use the tic/toc pair to measure the time.
tic
rnn.v = 11;
rnn.h = 101;
rnn.o = 7;
rnn.h_init = randn(1,rnn.h,'gpuArray');
rnn.W_vh = randn(rnn.v,rnn.h,'gpuArray');
rnn.W_hh = randn(rnn.h,rnn.h,'gpuArray');
rnn.W_ho = randn(rnn.h,rnn.o,'gpuArray');
inData.V = randn(10000,11,100,'gpuArray');
inData.TimeSteps =100;
inData.BatchSize = 10000;
[H,OX] = forward_pass(rnn, inData)
toc
All the matrices in rnn, and inData are gpuArray, so all the calculation are carried out in GPU. The outputs are also gpuArray.
function [H,OX] = forward_pass(rnn, inData)
tic;
%initial hidden state values
H_init = gpuArray(repmat(rnn.h_init,[inData.BatchSize,1]));
%initialize state H
H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');
%initialize OX (which is H * Who)
OX = zeros(inData.BatchSize, rnn.o, inData.TimeSteps,'gpuArray');
for t = 1 : inData.TimeSteps
if t == 1
HX_t = H_init * rnn.W_hh...
+ inData.V(:,:,t) * rnn.W_vh;
else
HX_t = H(:,:,(t-1)) * rnn.W_hh...
+ inData.V(:,:,t) * rnn.W_vh;
end
H(:,:,t) = tanh(HX_t);
OX(:,:,t) = H(:,:,t) * rnn.W_ho;
end
toc;
end
Normally, if you use gather() function, it will be slow. I didn't use the gather() function to transfer the outputs to workspace, I don't know why it is still so slow. It looks like the last line "end" takes more than 2secs.
Anyone knows how to accelerate the function call?
回答1:
First off, for proper benchmarking you do need to use gather
either inside the function call or afterwards. In the former case, you would have a non-gpu output from the function call and in the latter case, a gpu-based datatype would be the output. Now, back to your problem, you are using very few TimeSteps
and as such any optimization that you might try out won't reflect in a huge manner. Here's an optimized version that will show increased performance as you increase Timesteps
-
function [H,OX] = forward_pass(rnn, inData)
H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');
T = reshape(permute(inData.V,[1 3 2]),[],size(inData.V,2))*rnn.W_vh;
H(:,:,1) = tanh(bsxfun(@plus,rnn.h_init * rnn.W_hh,T(1:size(inData.V,1),:)));
for t = 2 : inData.TimeSteps
H(:,:,t) = tanh( H(:,:,(t-1))*rnn.W_hh + ...
T((t-1)*size(inData.V,1)+1: t*size(inData.V,1),:));
end
A = reshape(permute(H,[1 3 2]),[],size(H,2))*rnn.W_ho;
OX = permute(reshape(A,size(H,1),size(A,1)/size(H,1),[]),[1 3 2]);
return;
Benchmarking
Test Case #1
Parameters
rnn.v = 11;
rnn.h = 5;
rnn.o = 7;
inData.TimeSteps = 10000;
inData.BatchSize = 10;
Results
---- Original Code :
Elapsed time is 5.678876 seconds.
---- Modified Code :
Elapsed time is 3.821059 seconds.
Test Case #2
Parameters
inData.TimeSteps = 50000; (rest are same as in Test Case #1)
Results
---- Original Code :
Elapsed time is 28.392290 seconds.
---- Modified Code :
Elapsed time is 19.031776 seconds.
Please note that these are tested on GTX 750 Ti.
来源:https://stackoverflow.com/questions/25468639/matlab-is-slow-when-using-user-defined-function-with-calculation-in-gpu