High global memory instruction overhead - no idea where it comes from
问题 I wrote a kernel that computes euclidean distances between a given D-dimensional vector q (stored in constant memory) and an array pts of N vectors (also D-dimensional). The array layout in memory is such that the first N elements are the first coordinates of all N vectors, then a sequence of N second coordinates and so on. Here is the kernel: __constant__ float q[20]; __global__ void compute_dists(float *pt, float *dst, int n, int d) { for (int i = blockIdx.x * blockDim.x + threadIdx.x; i <