Recently ,I use Cuda to write an algorithm called 'orthogonal matching pursuit' . In my ugly Cuda code the entire iteration takes 60 sec , and Eigen lib takes just 3 sec...
In my code Matrix A is [640,1024] and y is [640,1] , in each step I select some vectors from A to compose a new Matrix called A_temp [640,itera], iter=1:500 . I new a array MaxDex_Host[] in cpu to tell which column to select .
I want to get x_temp[itera,1] from A_temp*x_temp=y using least-square , I use a cula API 'culaDeviceSgels' and cublas matrix-vector multiplication API.
So the culaDeviceSgels would call 500 times , and I think this would be faster than Eigen lib's QR.Sovler .
I check the Nisight performence anlysis , I found the custreamdestory takes a long time . I initial cublas before iteration and destory it after I get the result . So I want to know the what is the custreamdestory , different with cublasdestory?
The main problem is memcpy and function 'gemm_kernel1x1val' . I think this function is from 'culaDeviceSgels'
while(itera<500): I use cublasSgemv and cublasIsamax to get MaxDex_Host[itera] , then
MaxDex_Host[itera]=pos;
itera++;
float* A_temp_cpu=new float[M*itera]; // matrix all in col-major
for (int j=0;j<itera;j++) // to get A_temp [M,itera] , the MaxDex_Host[] shows the positon of which column of A to chose ,
{
for (int i=0;i<M;i++) //M=640 , and A is 640*1024 ,itera is add 1 each step
{
A_temp_cpu[j*M+i]=A[MaxDex_Host[j]*M+i];
}
}
// I must allocate one more array because culaDeviceSgels will decompose the one input Array , and I want to use A_temp after least-square solving.
float* A_temp_gpu;
float* A_temp2_gpu;
cudaMalloc((void**)&A_temp_gpu,Size_float*M*itera);
cudaMalloc((void**)&A_temp2_gpu,Size_float*M*itera);
cudaMemcpy(A_temp_gpu,A_temp_cpu,Size_float*M*itera,cudaMemcpyHostToDevice);
cudaMemcpy(A_temp2_gpu,A_temp_gpu,Size_float*M*itera,cudaMemcpyDeviceToDevice);
culaDeviceSgels('N',M,itera,1,A_temp_gpu,M,y_Gpu_temp,M);// the x_temp I want is in y_Gpu_temp's return value , stored in the y_Gpu_temp[0]——y_Gpu_temp[itera-1]
float* x_temp;
cudaMalloc((void**)&x_temp,Size_float*itera);
cudaMemcpy(x_temp,y_Gpu_temp,Size_float*itera,cudaMemcpyDeviceToDevice);
Cuda's memory manage seems too complex , is there any other convenience method to solve least-square?
I think that custreamdestory
and gemm_kernel1x1val
are internally called by the APIs you are using, so there is not much to do with them.
To improve your code, I would suggest to do the following.
- You can get rid of
A_temp_cpu
by keeping a device copy of the matrixA
. Then you can copy the rows ofA
into the rows ofA_temp_gpu
andA_temp2_gpu
by a kernel assignment. This would avoid performing the first twocudaMemcpy
s. - You can preallocate
A_temp_gpu
andA_temp2_gpu
outside thewhile
loop by using the maximum possible value ofitera
instead ofitera
. This will avoid the first twocudaMalloc
s inside the loop. The same applies tox_temp
. - As long as I know,
culaDeviceSgels
solves a linear system of equations. I think you can do the same also by using cuBLAS APIs only. For example, you can perform an LU factorization first bycublasDgetrfBatched()
and then usecublasStrsv()
two times to solve the two arising linear systems. You may wish to see if this solution leads to a faster algorithm.
来源:https://stackoverflow.com/questions/22284533/cuda-least-square-solving-poor-in-speed