High global memory instruction overhead - no idea where it comes from

问题

I wrote a kernel that computes euclidean distances between a given D-dimensional vector q (stored in constant memory) and an array pts of N vectors (also D-dimensional).

The array layout in memory is such that the first N elements are the first coordinates of all N vectors, then a sequence of N second coordinates and so on.

Here is the kernel:

__constant__ float q[20];

__global__ void compute_dists(float *pt, float *dst,
        int n, int d) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; 
             i += gridDim.x * blockDim.x) {
        float ld = 0;
        for (int j = 0; j < d; ++j) {
            float tmp = (q[j] - pts[j * n + i]); 
            ld += tmp * tmp;
        }
        dst[i] = ld;
    }
}r

It is invoked as follows:

const int N = 1000000, D = 20;
compute_dists<<<32, 512>>>(vecs, dists, vec, N, D);

Now, profiling this kernel using NVIDIA Visual Profiler on Quadro K1000M results in warnings about

High instruction replay overhead (31,2%) and
High global memory instruction overhead (31,2%).

This is very surprising for me, because as far as I can tell the memory access is coalesced (because j * n + i is always a multiple of 32 for the first warp in a thread which gives us a 128-byte alignment) and there is no branch divergence..

Are there some other factors that contribute to instruction replay overhead metric or am I missing something else?

回答1:

I think you have the problem of high TLB (Translation Lookaside Buffer) miss rate which comes from "pts[j * n + i]". Consecutive j-th elements have a high probability of not being present in the loaded memory page, since n is large. The TLB hardware has a high latency of loading the information where the page for the given memory location is. This leads to a memory load instruction replays. Every memory load instruction is reissued if the data is not present in the cache or if the page is not loaded into the TLB. Although I'm not entirely sure about the latter, this might be the case. Hope it helps. I have the same problem, but with a more serious, 97% replay. My question might answer yours as well.

来源：https://stackoverflow.com/questions/15993416/high-global-memory-instruction-overhead-no-idea-where-it-comes-from

标签

memory

cuda

gpu