Is clflush or clflushopt atomic when system crash？

问题

Commonly, cacheline is 64B but atomicity of non-volatile memory is 8B.

For example:

x[1]=100;
x[2]=100;
clflush(x);

x is cacheline aligned, and is initially set to 0.

System crashs in clflush();

Is it possible x[1]=0, x[2]=100 after reboot？

回答1:

Under the following assumptions:

I assume that the code you've shown represents a sequence of x86 assembly instructions rather than actual C code that is yet to be compiled.
I also assume that the code is being executed on a Cascade Lake processor and not on a later generation of Intel processors (I think CPL or ICX with Barlow Pass support eADR, meaning that explicit flushing is not required for persistence because the caches are in the persistence domain). This answer also applies to existing AMD+NVDIMM platforms.

The global observablility order of stores may differ from the persist order on Intel x86 processors. This is referred to as relaxed persistency. The only case in which the order is guaranteed to be the same is for a sequence of stores of type WB to the same cache line (but a store reaching GO doesn't necessarily meant it's become durable). This is because CLFLUSH is atomic and WB stores cannot be reordered in global observability. See: On x86-64, is the “movnti” or "movntdq" instruction atomic when system crash?.

If the two stores cross a cache line boundary or if the effective memory type of the stores is WC:

The x86-TSO memory model doesn't allow reordering stores, so it's impossible for another agent to observe x[2] == 100 and x[1] != 100 during normal operation (i.e., in the volatile state without a crash). However, if the system crashed and rebooted, it's possible for the persistent state to be x[2] == 100 and x[1] != 100. This is possible even if the system crashed after retiring clflush because the retirement of clflush doesn't necessarily mean that the cache line flushed has reached the persistence domain.

If you want to eliminate that possibly, you can either move clflush as follows:

x[1]=100;
clflush(x);
x[2]=100;

clflush on Intel processors is ordered with respect to all writes, meaning that the line is guaranteed to reach the persistence domain before any later stores become globally observable. See: Persistent Memory Programming Primary (PDF) and the Intel SDM V2. The second store could be to the same line or any other line.

If you want x[1]=100 to become persistent before x[2]=100 becomes globally observable, add sfence after clflush on Intel CSX or mfence on AMD processors (clflush is only ordered by mfence on AMD processors). clflush by itself sufficient to control persist order.

Alternatively, use the sequenceclflushopt+sfence (or clwb+sfence) as follows:

x[1]=100;
clflushopt(x);
sfence;
x[2]=100;

In this case, if a crashed happened and if x[2] == 100 in the persistent state, then it's guaranteed that x[1] == 100. clflushopt by itself doesn't impose any persist ordering.

回答2:

(Also see @Hadi's answer: x86 TSO store ordering does not guarantee persistence ordering even within one line. This answer doesn't try to address that. My best guess based on Hadi's answer is that a single atomic store to one 32-byte half of a cache line will persist atomically, but that's based on how current HW works, transferring lines in 2 32-byte halves between cores, caches, and memory controllers. If this really matters, look for docs or ask Intel.)

Remember that store data can propagate out of cache (into DRAM or NVDIMM) on its own, before explicit flushing.

The following sequence of events is possible:

x[2]=100; store the 3nd byte of the cache line first. (Compile-time reordering: this is a C not asm question and x is apparently plain uint8_t x[64], not _Atomic or volatile so x[1]=100; and x[2]=100; aren't guaranteed to happen in that order in the asm.)
An interrupt arrives; at some point the cache line containing x[] is evicted all the way out of cache, into the persistence domain. (Perhaps after a context-switch to another thread, so lots of other code runs between those two asm stores).
The system crashes before execution resumes. (Or before x[1]=100; finishes becoming durable.)

If you want to depend on x86 memory ordering rules to control durability order within a cache line, you need to make sure C respects that. volatile would work, or _Atomic with memory_order_release for at least the 2nd store. (Or better, get them done as a single store if it's within an aligned 8-byte chunk.) (x86 asm memory model = program order with a store buffer; no StoreStore reordering.)

Compile-time reordering doesn't usually happen for no reason (but it can); more often due to surrounding code making it appealing to do so. But surrounding code could cause this. (And of course x[1]=100; / x[2]=0; can happen by this mechanism without any compile-time reordering, if it's 2 separate stores.)

I think a necessary pre-condition for atomicity of durability is being done as a single atomic store. e.g. guaranteed atomic by the ISA, or with a single wider SIMD store¹ because Intel CPUs in practice don't split those up (but there's no on-paper guarantee of that). Being atomic wrt. interrupts (i.e. a single instruction) but not a single store uop makes it harder to split up but still completely possible² and thus not guaranteed safe. e.g. a 10-byte x87 fstp tbyte that involves 2 separate store-data uops can be split up by an invalidation from another core that's possible even without false sharing. (See footnote 2 again.)

Without any on-paper atomicity guarantee for 16-byte or wider SIMD stores, you'd be depending on implementation details for SIMD stores or misaligned stores to not be split up.

Even ISA-guaranteed atomicity isn't sufficient, though: a lock cmpxchg that spans a cache-line boundary is still guaranteed atomic wrt. other cores and DMA readers. (Supporting this is very very slow, don't do it.) But there's no way to guarantee those two lines become durable at the same time. But outside of that special case for atomicity, IDK, I can't rule out whole-line atomicity. It's certainly plausible that a plain-store into a single line which is atomic in asm will become durable atomically, with no chance of tearing.

Within a single cache line, I don't know.

I'd guess that an atomic store within an 8-byte-aligned block will make it to persistence atomically or not at all, but I haven't checked Intel's docs. (And in practice perhaps even a whole 64-byte line, which you could store with AVX512). The point of this answer is that you don't even have a single atomic store so there are lots of other mechanisms for breaking your test-case.

Footnote 1: Modern Intel CPUs commit SIMD stores to L1d cache as a single transaction as long as they don't span a cache line. Intel hasn't made a CPU that splits SIMD stores into 2 halves since Sandy/Ivy Bridge which had full-width 256-bit AVX execution units but only 128-bit wide paths to/from cache in the load units and AFAIK in the store-buffer-commit stuff. (The store-data execution unit also took 2 cycles to write 32-byte store data to the store buffer).

Footnote 2: For separate store uops that are part of the same instruction like in fstp tbyte [rdi], this might be possible:

first part commits from the store buffer to L1d cache
an RFO or share-request arrives and is handled before the 2nd store from the same instruction commits: This core's copy is now Invalid or Shared so the commit from the store buffer to L1d is blocked until it regains exclusive ownership. The 2nd part of the store from that one instruction is at the head of the store buffer, not in coherent cache.
The other core that was doing an RFO followed up their store with a clflush, evicting this line to persistent memory before the first core can get it back and finish committing the other data from that one instruction.

An NT store like movnti by another core would force eviction of the line as part of committing the NT store, like a normal store + clflushopt.

This scenario requires false-sharing between two threads trying to persist 2 separate things in the same line, so can be avoided if you avoid false-sharing e.g. with padding. (Or some insane true sharing, or firing off clflush without storing first, on memory that other threads might be in the middle of writing).
(Or more plausible for software, much less plausible for hardware): The line gets evicted on its own before the first writer gets it back, even though a core has a pending RFO for it. (As soon as it loses ownership, the first core would send out an RFO).
(Or fully plausible without false-sharing): Forced eviction from L2/L1d at any time due to eviction from an inclusive cache-line tracking structure. This could be triggered by demand for lines that merely happen to alias the same set in L3, not false sharing.

Skylake-server (SKX) has non-inclusive L3, as do later Intel server CPUs. Cascade Lake (CSX) was the first to support persistent memory. Even though it has a non-inclusive L3, the snoop filter is inclusive and a fill conflict that causes an eviction does cause a back invalidation in the entire NUMA node.

So an invalidate request can arrive at any time, and it's likely the core / store buffer isn't going to hold onto the line for more cycles to commit an unknown number of more stores to the same line.

(By that point, the fact that both store-buffer entries were part of one instruction is probably lost. It's possible for an access pattern to create a stream of store buffer entries that store different parts of the same cache line indefinitely, so waiting until "all stores for this line are done" could let unprivileged code create a denial-of-service for a core that wanted to read it. So I think it's unlikely that HW would have a mechanism to avoid releasing ownership of a cache line between stores that came from the same instruction.)

来源：https://stackoverflow.com/questions/65439089/is-clflush-or-clflushopt-atomic-when-system-crash

标签

x86-64

intel

atomicity

persistent-memory