I am aware of multiple questions on this topic, however, I haven\'t seen any clear answers nor any benchmark measurements. I thus created a simple program that works with tw
I answer my own question since I found the following post from Intel Developer Forum, which makes sense for me. It was written by John McCalpin:
The results for the mainstream processors are not surprising -- in the absence of true "scratchpad" memory, it is not clear that it is possible to design an implementation of "non-temporal" behavior that is not subject to nasty surprises. Two approaches that have been used in the past are (1) loading the cache line, but marking it LRU instead of MRU, and (2) loading the cache line into one specific "set" of the set-associative cache. In either case it is relatively easy to generate situations in which the cache drops the data before the processor completes reading it.
Both of these approaches risk performance degradation in cases operating on more than a small number of arrays, and are made much more difficult to implement without "gotchas" when HyperThreading is considered.
In other contexts I have argued for the implementation of "load multiple" instructions that would guarantee that the entire contents of a cache line would be copied to registers atomically. My reasoning is that the hardware absolutely guarantees that the cache line is moved atomically and that the time required to copy the remainder of the cache line to registers was so small (an extra 1-3 cycles, depending on the processor generation) that it could be safely implemented as an atomic operation.
Starting with Haswell, the core can read 64 Bytes in a single cycle (2 256-bit aligned AVX reads), so the exposure to unintended side effects becomes even lower.
Starting with KNL, full-cache-line (aligned) loads should be "naturally" atomic, since the transfers from the L1 Data Cache to the core are full cache lines and all of the data is placed into the target AVX-512 register. (This does not mean that Intel guarantees atomicity in the implementation! We don't have visibility into the horrible corner cases that the designers have to account for, but it is reasonable to conclude that most of the time aligned 512-bit loads will occur atomically.) With this "natural" 64-Byte atomicity, some of the tricks used in the past for reducing cache pollution due to "non-temporal" loads may deserve another look....
The MOVNTDQA instruction is intended primarily for reading from address ranges that are mapped as "Write-Combining" (WC), and not for reading from normal system memory that is mapped "Write-Back" (WB). The description in Volume 2 of the SWDM says that an implementation "may" do something special with MOVNTDQA for WB regions, but the emphasis is on the behavior for the WC memory type.
The "Write-Combining" memory type is almost never used for "real" memory --- it is used almost exclusively for Memory-Mapped IO regions.
See here for the whole post: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/597075