Why is acquire semantics only for reads, not writes? How can an LL/SC acquire CAS take a lock without the store reordering with the critical section?

问题

To start with, consider release semantics. If a data set is protected with a spinlock (mutex, etc. - no matters what exact implementation is used; for now, assume 0 means it's free and 1 - busy). After changing of the data set, a thread stores 0 to spinlock address. To force visibility of all previous actions before storing 0 to spinlock address, storing is executed with release semantics, that means all previous reads and writes shall be made visible to other threads before this storing. It is implementation detail whether this is done with full barrier, or release mark of the single store operation. That is (I hope) clear without any doubt.

Then, consider them moment when spinlock ownership is being taken. To protect against race, this is any kind of compare-and-set operation. With single-instruction CAS implementation (X86, Sparc...), this is combined reading and writing. The same for X86 atomic XCHG. With LL/SC (most RISCs), this falls to:

Read (LL) the spinlock location until it shows free state. (Can be optimized with a kind of CPU stall.)
Write (SC) the value "occupied" (1 in our case). CPU exposes whether the operation was successful (condition flag, output register, etc.)
Check the write (SC) result and, if failed, go to step 1.

In all cases, the operation that shall be visible to other threads to show that spinlock is occupied, is writing of 1 to its location, and barrier shall be committed between this writing and following manipulations on the data set protected with the spinlock. Reading of this spinlock gives nothing to protection scheme, except permit of CAS or LL/SC operation.

But all really implemented schemes allow acquire semantics modification on reads (or CAS), not writes. As result, LL/SC scheme would require additional final read-with-acquire operation on the spinlock to commit the required barrier. But there is no such instruction in typical output. For example, if compile on ARM:

  for(;;) {
    int e{0};
    int d{1};
    if (std::atomic_compare_exchange_weak_explicit(p, &e, d,
          std::memory_order_acquire,
          std::memory_order_relaxed)) {
      return;
    }
  }

its output contains first LDAXR == LL+acquire, then STXR == SC (without barrier in it, so, there is no guarantee other threads will see it?) This is likely not my artifact but is generated e.g. in glibc: pthread_spin_trylock calls __atomic_compare_exchange_weak_acquire (and no more barriers), that falls into GCC builtin __atomic_compare_exchange_n with acquire on mutex reading and no release on mutex writing.

It seems Iʼve missed some principal detail in this consideration. Would anybody correct it?

This also could fall into 2 sub-questions:

SQ1: In instruction sequence like:

(1) load_linked+acquire mutex_address     ; found it is free
(2) store_conditional mutex_address       ; succeeded
(3) read or write of mutex-protected area

what prevents CPU against reordering (2) and (3), with result that other threads won't see mutex is locked?

SQ2: Is there a design factor that suggests having acquire semantics only on loads?

I've seen that some examples of lock-free code, such as:

thread 1:

var = value;
flag.store(true, std::memory_order_release);

thread 2:

if (flag.load(std::memory_order_acquire)) {
   // We already can access it!!!
   value = var;
   ... do something with value ...
}

but this should have been made working after the mutex-protected style gets working stably.

回答1:

Its output contains first LDAXR == LL+acquire, then STXR == SC
(without barrier in it, so, there is no guarantee other threads will see it?)

Huh? Stores always become visible to other threads; the store buffer always drains itself as fast as possible. The question is only whether to block later loads/stores in this thread until the store buffer is empty. (That's required for seq-cst pure stores, for example).

STXR is exclusive and tied to the LL. So it and the load are indivisible in the global order of operations, as the load and store side of an atomic RMW operation, just like x86 does in one instruction with lock cmpxchg.

The atomic RMW can move earlier (because acquire loads can do that, and so can relaxed stores). But it can't move later (because acquire-loads can't do that). Therefore the atomic RMW appears in the global order before any operations in the critical section, and is sufficient for taking a lock. It doesn't have to wait for earlier operations like cache-miss stores; it can let them move into the critical section. But that's not a problem.

However, if you had used an acq_rel CAS, it couldn't take the lock until after finishing all earlier loads/stores (because of the release semantics of the store side).

I'm not sure if there's any asm difference between acq_rel and seq_cst for an atomic RMW. Possibly on PowerPC? Not on x86, all RMWs are seq_cst. Not on AArch64: it only has relaxed and sequential-release.

LDAR + STR would be like x86 cmpxchg without a lock prefix: acquire load and separate store. (Except that the store side of x86 cmpxchg is still a release-store (but not sequential-release) because of the x86 memory model.

Other confirmation of my reasoning that mo_acquire for the "success" side of a CAS is sufficient for taking a lock:

https://en.cppreference.com/w/cpp/atomic/memory_order says "The lock() operation on a Mutex is also an acquire operation"
Glibc's pthread_spin_trylock uses the GCC builtin __atomic_compare_exchange_n on the mutex with only acquire, not acq_rel or seq_cst. We know many smart people have looked at glibc. And on platforms where it's not effectively strengthened to seq-cst asm, bug bugs probably would have been noticed if there were any.

what prevents CPU against reordering (2) and (3), with result that other threads won't see mutex is locked?

That would require other threads see the LL and SC as separate operations, not as an atomic RMW. The whole point of LL/SC is to prevent that. Weaker ordering lets it move around as a unit, not split apart.

SQ2: Is there a design factor that suggests having acquire semantics only on loads?

Yes, consider pure loads and pure stores, not RMWs. Jeff Preshing on acq and rel semantics.

The one-way barrier of a release-store naturally works well with the store buffer on real CPUs. CPUs "want" to load early and store late. Perhaps Jeff Preshing's article Memory Barriers Are Like Source Control Operations is a helpful analogy for how CPUs interact with coherent cache.

A store that could only appear earlier, not later, would basically require flushing the store buffer. i.e. relaxed store followed by a full barrier (like atomic_thread_fence(seq_cst), e.g. ARM dsb ish or x86 mfence or locked operation). This is what you get from a seq-cst store. So we more or less already have a name for it, and it's very expensive.

来源：https://stackoverflow.com/questions/58361491/why-is-acquire-semantics-only-for-reads-not-writes-how-can-an-ll-sc-acquire-ca

标签

assembly

cpu-architecture

atomicity

spinlock

compare-and-swap