I am working on data prefetch in CUDA (Fermi GPU) through C code. Cuda reference manual talks about the prefetching at ptx level code not at C level code.
Can anyone connect me with some documents or something regarding prefetching through cuda code (cu file). Any help would be appreciated.
According to PTX manual here is how prefetch works in PTX:
You can embed the PTX instructions into the CUDA kernel. Here is a tiny sample from NVIDIA's documentation:
__device__ int cube (int x)
{
int y;
asm("{\n\t" // use braces for local scope
" .reg .u32 t1;\n\t" // temp reg t1,
" mul.lo.u32 t1, %1, %1;\n\t" // t1 = x * x
" mul.lo.u32 %0, t1, %1;\n\t" // y = t1 * x
"}"
: "=r"(y) : "r" (x));
return y;
}
You may come to conclude with the following prefetch function in C:
__device__ void prefetch_l1 (unsigned int addr)
{
asm(" prefetch.global.L1 [ %1 ];": "=r"(addr) : "r"(addr));
}
NOTICE: You need the GPU of Compute Capability 2.0 or higher for prefetch. Pass proper compile flags accordingly -arch=sm_20
According to this thread, below is the code for different cache prefetching techniques:
#define DEVICE_STATIC_INTRINSIC_QUALIFIERS static __device__ __forceinline__
#if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__)
#define PXL_GLOBAL_PTR "l"
#else
#define PXL_GLOBAL_PTR "r"
#endif
DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_l1(const void* const ptr)
{
asm("prefetch.global.L1 [%0];" : : PXL_GLOBAL_PTR(ptr));
}
DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_uniform(const void* const ptr)
{
asm("prefetchu.L1 [%0];" : : PXL_GLOBAL_PTR(ptr));
}
DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_l2(const void* const ptr)
{
asm("prefetch.global.L2 [%0];" : : PXL_GLOBAL_PTR(ptr));
}
来源:https://stackoverflow.com/questions/13265993/prefetch-in-cuda-through-c-code