Do current x86 architectures support non-temporal loads (from “normal” memory)?

大兔子大兔子 提交于 2019-11-27 11:01:55

To answer specifically the headline question:

Yes, recent1 mainstream Intel CPUs support non-temporal loads on normal 2 memory - but only "indirectly" via non-temporal prefetch instructions, rather than directly using non-temporal load instructions like movntdqa. This is in contrast to non-temporal stores where you can just use the corresponding non-temporal store instructions3 directly.

The basic idea is that you issue a prefetchnta to the cache line before any normal loads, and then issue loads as normal. If the line wasn't already in the cache, it will be loaded in a non-temporal fashion. The exact meaning of non-temporal fashion depends on the architecture but the general pattern is that the line is loaded into at least the L1 and perhaps some higher cache levels. Indeed for a prefetch to be of any use it needs to cause the line to loaded at least into some cache level for consumption by a later load. The line may also be treated specially in the cache, for example by flagging it as high priority for eviction or restricting the ways in which it can be placed.

The upshot of all this is that while non-temporal loads are supported in a sense, they are really only partly non-temporal unlike stores where you really leave no trace of the line in any of the cache levels. Non-temporal loads will cause some cache pollution, but generally less than regular loads. The exact details are architecture specific, and I've included some details below for modern Intel (you can find a slightly longer writeup in this answer).

Skylake Client

Based on the tests in this answer it seems that the behavior for prefetchnta Skylake is to fetch normally into the L1 cache, to skip the L2 entirely, and fetches in a limited way into the L3 cache (probably into 1 or 2 ways only so the total amount of the L3 available to nta prefetches is limited).

This was tested on Skylake client, but I believe this basic behavior probably extends backwards probably to Sandy Bridge and earlier (based on wording in the Intel optimization guide), and also forwards to Kaby Lake and later architectures based on Skylake client. So unless you are using a Skylake-SP or Skylake-X part, or an extremely old CPU, this is probably the behavior you can expect from prefetchnta.

Skylake Server

The only recent Intel chip known to have different behavior is Skylake server (used in Skylake-X, Skylake-SP and a few other lines). This has a considerably changed L2 and L3 architecture, and the L3 is no longer inclusive of the much larger L2. For this chip, it seems that prefetchnta skips both the L2 and L3 caches, so on this architecture cache pollution is limited to the L1.

This behavior was reported by user Mysticial in a comment. The downside, as pointed out in those comments is that this makes prefetchnta much more brittle: if you get the prefetch distance or timing wrong (especially easy when hyperthreading is involved and the sibling core is active), and the data gets evicted from L1 before you use, you are going all the way back to main memory rather than the L3 on earlier architectures.


1Recent here probably means anything in the last decade or so, but I don't mean to imply that earlier hardware didn't support non-temporal prefetch: it's possible that support goes right back to the introduction of prefetchnta but I don't have the hardware to check that and can't find an existing reliable source of information on it.

2Normal here just means WB (writeback) memory, which is the memory dealing with at the application level the overwhelming majority of the time.

3 Specifically, the NT store instructions are movnti for general purpose registers and the movntd* and movntp* families for SIMD registers.

I answer my own question since I found the following post from Intel Developer Forum, which makes sense for me. It was written by John McCalpin:

The results for the mainstream processors are not surprising -- in the absence of true "scratchpad" memory, it is not clear that it is possible to design an implementation of "non-temporal" behavior that is not subject to nasty surprises. Two approaches that have been used in the past are (1) loading the cache line, but marking it LRU instead of MRU, and (2) loading the cache line into one specific "set" of the set-associative cache. In either case it is relatively easy to generate situations in which the cache drops the data before the processor completes reading it.

Both of these approaches risk performance degradation in cases operating on more than a small number of arrays, and are made much more difficult to implement without "gotchas" when HyperThreading is considered.

In other contexts I have argued for the implementation of "load multiple" instructions that would guarantee that the entire contents of a cache line would be copied to registers atomically. My reasoning is that the hardware absolutely guarantees that the cache line is moved atomically and that the time required to copy the remainder of the cache line to registers was so small (an extra 1-3 cycles, depending on the processor generation) that it could be safely implemented as an atomic operation.

Starting with Haswell, the core can read 64 Bytes in a single cycle (2 256-bit aligned AVX reads), so the exposure to unintended side effects becomes even lower.

Starting with KNL, full-cache-line (aligned) loads should be "naturally" atomic, since the transfers from the L1 Data Cache to the core are full cache lines and all of the data is placed into the target AVX-512 register. (This does not mean that Intel guarantees atomicity in the implementation! We don't have visibility into the horrible corner cases that the designers have to account for, but it is reasonable to conclude that most of the time aligned 512-bit loads will occur atomically.) With this "natural" 64-Byte atomicity, some of the tricks used in the past for reducing cache pollution due to "non-temporal" loads may deserve another look....


The MOVNTDQA instruction is intended primarily for reading from address ranges that are mapped as "Write-Combining" (WC), and not for reading from normal system memory that is mapped "Write-Back" (WB). The description in Volume 2 of the SWDM says that an implementation "may" do something special with MOVNTDQA for WB regions, but the emphasis is on the behavior for the WC memory type.

The "Write-Combining" memory type is almost never used for "real" memory --- it is used almost exclusively for Memory-Mapped IO regions.

See here for the whole post: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/597075

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!