Can't reach peak performance
问题 I'm trying to reach peak performance of each SM from the code below. The peak lies somewhere between 25 GFlops(GTX275-GT200 Arch.). This code gives 8 GFlops at the max. __global__ void new_ker(float *x) { int index = threadIdx.x+blockIdx.x*blockDim.x; float a,b; a=0; b=x[index]; //LOOP=10000000 //No. of blocks = 1 //Threads per block = 512 (I'm using GTX 275 - GT200 Arch.) #pragma unroll 2048 for(int i=0;i<LOOP;i++){ a=a*b+b; } x[index] = a; } I don't want to increase ILP in the code. Any