Example of increasing the work per thread in CUDA
问题 Algorithm : I'm writing a program with CUDA and the problem is the following: Two matrices A (n * 128) and B (m * 128) I take the first row of A, and I compute the distance between that vector and all the rows of B, one by one. I write the result of each distance on a row of a matrix C, so the element C(i,j) of C contains the distance between row i of A and row j of B. and I proceed with the next row of A. I've implemented it this way: I've got a grid made by ( n * m ) blocks, and 128 threads