While profiling my kernels in Visual Profiler on Kepler hardware, I’ve noticed the profiler shows that global loads and stores are cached in L1. I\'m confused because the pr
On Fermi and Kepler architectures all generic, global, local, and shared memory operations are handled by the L1 cache. Shared memory accesses do not require a tag look up and do not invalidate a cache line. All local and global memory accesses require a tag look up. Uncached global memory stores and reads will invalidate a cache line. On compute capability 3.0 and 3.5 all global memory reads with exception to LDG on CC 3.5 will be uncached. LDG instruction goes through the texture cache.