High global memory instruction overhead - no idea where it comes from

試著忘記壹切 提交于 2020-01-05 07:56:11

问题


I wrote a kernel that computes euclidean distances between a given D-dimensional vector q (stored in constant memory) and an array pts of N vectors (also D-dimensional).

The array layout in memory is such that the first N elements are the first coordinates of all N vectors, then a sequence of N second coordinates and so on.

Here is the kernel:

__constant__ float q[20];

__global__ void compute_dists(float *pt, float *dst,
        int n, int d) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; 
             i += gridDim.x * blockDim.x) {
        float ld = 0;
        for (int j = 0; j < d; ++j) {
            float tmp = (q[j] - pts[j * n + i]); 
            ld += tmp * tmp;
        }
        dst[i] = ld;
    }
}r

It is invoked as follows:

const int N = 1000000, D = 20;
compute_dists<<<32, 512>>>(vecs, dists, vec, N, D);

Now, profiling this kernel using NVIDIA Visual Profiler on Quadro K1000M results in warnings about

  • High instruction replay overhead (31,2%) and
  • High global memory instruction overhead (31,2%).

This is very surprising for me, because as far as I can tell the memory access is coalesced (because j * n + i is always a multiple of 32 for the first warp in a thread which gives us a 128-byte alignment) and there is no branch divergence..

Are there some other factors that contribute to instruction replay overhead metric or am I missing something else?


回答1:


I think you have the problem of high TLB (Translation Lookaside Buffer) miss rate which comes from "pts[j * n + i]". Consecutive j-th elements have a high probability of not being present in the loaded memory page, since n is large. The TLB hardware has a high latency of loading the information where the page for the given memory location is. This leads to a memory load instruction replays. Every memory load instruction is reissued if the data is not present in the cache or if the page is not loaded into the TLB. Although I'm not entirely sure about the latter, this might be the case. Hope it helps. I have the same problem, but with a more serious, 97% replay. My question might answer yours as well.



来源:https://stackoverflow.com/questions/15993416/high-global-memory-instruction-overhead-no-idea-where-it-comes-from

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!