Does this envelope implementation correctly use C++11 atomics?

问题

I have written a simple 'envelope' class to make sure I understand the C++11 atomic semantics correctly. I have a header and a payload, where the writer clears the header, fills in the payload, then fills the header with an increasing integer. The idea is that a reader then can read the header, memcpy out the payload, read the header again, and if the header is the same the reader can then assume they successfully copied the payload. It's OK that the reader may miss some updates, but it's not OK for them to get a torn update (where there is a mix of bytes from different updates). There is only ever a single reader and a single writer.

The writer uses release memory order and the reader uses acquire memory order.

Is there any risk of the memcpy being reordered with the atomic store/load calls? Or can the loads be reordered with each other? This never aborts for me but maybe I'm lucky.

#include <iostream>
#include <atomic>
#include <thread>
#include <cstring>

struct envelope {
    alignas(64) uint64_t writer_sequence_number = 1;
    std::atomic<uint64_t> sequence_number;
    char payload[5000];

    void start_writing()
    {
        sequence_number.store(0, std::memory_order::memory_order_release);
    }

    void publish()
    {
        sequence_number.store(++writer_sequence_number, std::memory_order::memory_order_release);
    }

    bool try_copy(char* copy)
    {
        auto before = sequence_number.load(std::memory_order::memory_order_acquire);
        if(!before) {
            return false;
        }
        ::memcpy(copy, payload, 5000);
        auto after = sequence_number.load(std::memory_order::memory_order_acquire);
        return before == after;
    }
};

envelope g_envelope;

void reader_thread()
{
    char local_copy[5000];
    unsigned messages_received = 0;
    while(true) {
        if(g_envelope.try_copy(local_copy)) {
            for(int i = 0; i < 5000; ++i) {
                // if there is no tearing we should only see the same letter over and over
                if(local_copy[i] != local_copy[0]) {
                    abort();
                }
            }
            if(messages_received++ % 64 == 0) {
                std::cout << "successfully received=" << messages_received << std::endl;
            }
        }
    }
}

void writer_thread()
{
    const char alphabet[] = {"ABCDEFGHIJKLMNOPQRSTUVWXYZ"};
    unsigned i = 0;
    while(true) {
        char to_write = alphabet[i % (sizeof(alphabet)-1)];
        g_envelope.start_writing();
        ::memset(g_envelope.payload, to_write, 5000);
        g_envelope.publish();
        ++i;
    }
}

int main(int argc, char** argv)
{
    std::thread writer(&writer_thread);
    std::thread reader(&reader_thread);

    writer.join();
    reader.join();

    return 0;
}

回答1:

This is called a seqlock; it has a data race simply because of the conflicting calls to memset and memcpy. There have been proposals to provide a memcpy-like facility to make this sort of code correct; the most recent is not likely to appear before C++26 (even if approved).

回答2:

This is called a seqlock. It's a known pattern, and it works well for publish occasionally, read often. If you republish too often (especially for a buffer as large as 5000 bytes), you risk too many retries by the readers as they keep detecting possible tearing. It's commonly used to e.g. publish a 64-bit or 128-bit timestamp from a timer interrupt handler to all cores, where the fact that the writer doesn't have to acquire a lock is great, and so is the fact that readers are read-only and have negligible overhead in the fast-path.

Acq and Rel are one-way barriers.

You need atomic_thread_fence(mo_acquire) before the 2nd load of the sequence number in the reader to make sure it doesn't happen earlier, before the memcpy finishes. And same for atomic_thread_fence(mo_release) in the writer, after the first store before writing the data. Note that acquire / release fences are 2-way barriers, and do affect non-atomic variables¹. (Despite misconceptions to the contrary, fences really are 2-way barriers, unlike acquire or release operations. Jeff Preshing explains and debunks the confusion)

See also Implementing 64 bit atomic counter with 32 bit atomics for my attempt at a templated SeqLock class. I required the template class T to provide an assignment operator to copy itself around, but using memcpy might be better. I was using volatile for extra safety against the C++ UB we include. That works easily for uint64_t but is a huge pain in C++ for anything wider, unlike in C where you can get the compiler to efficiently emit code to load from a volatile struct into a non-volatile temporary.

You're going to have C++ data-race UB either way (because C++ makes best efficiency impossible without UB: the whole point of a SeqLock is to let tearing potentially happen on data[], but detect that and never actually look at the torn data). You could avoid UB by copying your data as an array of atomic<unsigned long> or something, but current compilers aren't smart enough to use SIMD for that so the access to the shared data would be inefficient. (And HW vendors fail to document Per-element atomicity of vector load/store and gather/scatter?, even though we all know that current CPUs do give that and future CPUs almost certainly will too.)

A memory barrier is probably sufficient, but it would be nice to do something to "launder" the value to make extra sure the compiler doesn't put another reload of the non-atomic data after the 2nd load. Like What is the purpose of glibc's atomic_forced_read function?. But as I said, I think atomic_thread_fence() is sufficient. At least in practice with compilers like GCC, which treat thread_fence like asm("":::"memory") that tells the compiler all values in memory might have changed.

Footnote 1: Maxim points out that atomic_thread_fence may be sort of a hack because ISO C++ specifies things only in terms of barriers and release-sequences synchronizing with loads that see the value stored.

But it's well known how fences and acq/rel loads/stores map to asm for any given target platform. It's implausible that a compiler will do enough whole-program inter-thread analysis to prove that it can break your code.

There might be an argument to be made in terms of the language used in the C++ standard about establishing happens-before relationships between the store of tmp+1 and at least some hypothetical reader. In practice that's enough to stop a compiler from breaking the writer: it can't know what code will be reading the data it's writing so it has to respect barriers. And probably the language in the standard is strong enough that a reader that sees an odd sequence number (and avoids reading data[]) can avoid data-race UB, so there would be a valid happens-before relationship between an atomic store that has to stay ahead of some non-atomic stores. So I'm not convinced that there's any room for a malicious all-seeing compiler to not respect atomic_thread_fence() there, let alone any real compiler.

In any case, you definitely do not want _mm_lfence() on x86. You want the compiler barrier against runtime reordering, but you definitely do not want the main effect of lfence: blocking out-of-order execution.Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths and Are loads and stores the only instructions that gets reordered?

i.e. you just want GNU C asm("":::"memory"), aka atomic_signal_fence(mo_seq_cst). Also equivalent to atomic_thread_fence(mo_acq_rel) on x86, which only has to block compile-time reordering to control runtime ordering, because the only runtime reording x86's strong memory model allows is StoreLoad (except for NT stores). x86's memory model is seq_cst + a store-buffer with store forwarding (which weakens seq_cst to acq/rel, and occasionally has other funky effects especially for loads that partially overlap a store).

For more about _mm_lfence() and so on vs. the asm instructions, see When should I use _mm_sfence _mm_lfence and _mm_mfence.

Other tweaks

Your sequence number is unnecessarily wide, and 64-bit atomics are less efficient on some 32-bit platforms, and very inefficient on a few. A 32-bit sequence number won't wrap in any reasonable thread-sleep time. (e.g. a 4GHz CPU will take about a whole second to do 2^32 stores at 1 store per clock, and that's with zero contention for writes to the cache line. And no cycles spend executing stores of the actual data. And practical use-cases don't have the writer in a tight loop publishing new values constantly: that could lead to something similar to livelock with readers constantly retrying and making no progress.)

unsigned long is never (AFAIK) too wide to handle efficiently, except on CPUs narrower than 32-bit. So atomic<long> or atomic<unsigned long> would use a 64-bit counter on CPUs where that's fine, but definitely avoid the risk of using a 64-bit atomic in 32-bit code. And long is required to be at least 32 bits wide.

Also, you don't need two copies of the write sequence number. Just have the writer do an atomic load into a tmp var, then separate atomic stores of tmp+1 and tmp+2.

(You're correct in wanting to avoid sequence_number++; it would be a bad idea to ask the compiler to do two atomic RMWs). The only advantage of a separate non-atomic var for the writer's private seq number is if this can inline into a write loop and keep it in a register so the writer never reloads the value.

来源：https://stackoverflow.com/questions/58088943/does-this-envelope-implementation-correctly-use-c11-atomics

标签

c++

multithreading

thread-safety

memory-model

stdatomic