Understanding `_mm_prefetch`

问题

The answer What are _mm_prefetch() locality hints? goes into details on what the hint means.

My question is: which one do I WANT?

I work on a function that is called repeatedly, billions of times, with some int parameter among others. First thing I do is to look up some cached value using that parameter (its low 32 bits) as a key into 4GB cache. Based on the algorithm from where this function is called, I know that most often that key will be doubled (shifted left by 1 bit) from one call to the next, so I am doing:

int foo(int key) {
  uint8_t value = cache[key];
  _mm_prefetch((const char *)&cache[key * 2], _MM_HINT_T2);
  // ...

The goal is to have this value in a processor cache by the next call to this function.

I am looking for confirmation on my understanding of two points:

The call to _mm_prefetch is not going to delay the processing of the instructions immediately following it.
There is no penalty for pre-fetching wrong location, just a lost benefit from guessing it right.

That function is using a lookup table of 128 128-bit values (2 KB total). Is there a way to “force” it to be cached? The index into that lookup table is incremented sequentially; should I pre-fetch them too? I should probably use another hint, to point to another level of cache? What is the best strategy here?

回答1:

As I noted in the comments, there's some risk to prefetching the wrong address - a useful address will be evicted from the cache, potentially causing a cache miss.

That said:

_mm_prefetch compiles into the PREFETCHn instruction. I looked up the instruction in the AMD64 Architecture Programmer's Manual published by AMD. (Note that all of this information is necessarily chipset specific; you may need to find your CPU's docs).

AMD says (my emphasis):

The operation of this instruction is implementation-dependent. The processor implementation can ignore or change this instruction. The size of the cache line also depends on the implementation, with a minimum size of 32 bytes. AMD processors alias PREFETCH1 and PREFETCH2 to PREFETCH0

What that appears to mean is that if you're running on an AMD, then the hint is ignored, and the memory is loaded into the all levels of the cache -- unless it's a hint that it's a NTA (Non-Temporal-Access, attempts to load memory with minimal cache pollution).

Here's the full page for the instruction

I think in the end, the guidance is what the other answer says: brainstorm, implement, test, and measure. You're on the bleeding edge of perf here, and there's not going to be a one size fits all answer.

Another resource that may help you is Agner Fog's Optimization manuals, which will help you optimize for your specific CPU.

回答2:

If you do anything related to performance, the best and ultimate way to know what you need is to try it. Fortunately, you know exactly what to try, and there are just a few possibilities.

Regarding your understanding — yes, it is correct. However, there is a cost to anything (e.g. if you add any instruction to your code, the processor will waste a nanosecond executing it). You should verify your idea of prefetching by measuring the performance before and after. For very irregular access patterns, it is very likely to work.

Regarding prefetching any sequential data — you should probably not bother. Caches hold data at 64-byte granularity, so for sequential data, prefetching will usually not help. In addition, some (all?) caches have predictive loading — they prefetch ahead even when not told to.

来源：https://stackoverflow.com/questions/65604355/understanding-mm-prefetch

标签

c++

performance

intrinsics

micro-optimization

prefetch