C++ latency increases when memory ordering is relaxed

问题

I am on Windows 7 64-bit, VS2013 (x64 Release build) experimenting with memory orderings. I want to share access to a container using the fastest synchronization. I opted for atomic compare-and-swap.

My program spawns two threads. A writer pushes to a vector and the reader detects this.

Initially I didn't specify any memory ordering, so I assume it uses memory_order_seq_cst?

With memory_order_seq_cst the latency is 340-380 cycles per op.

To try and improve performance I made stores use memory_order_release and loads use memory_order_acquire.

However, the latency increased to approx 1,940 cycles per op.

Have I misunderstood something? Full code below.

Using default memory_order_seq_cst:

#include <iostream>
#include <atomic>
#include <thread>
#include <vector>

std::atomic<bool> _lock{ false };
std::vector<uint64_t> _vec;
std::atomic<uint64_t> _total{ 0 };
std::atomic<uint64_t> _counter{ 0 };
static const uint64_t LIMIT = 1000000;

void writer()
{
    while (_counter < LIMIT)
    {
        bool expected{ false };
        bool val = true;

        if (_lock.compare_exchange_weak(expected, val))
        {
            _vec.push_back(__rdtsc());
            _lock = false;
        }
    }
}

void reader()
{
    while (_counter < LIMIT)
    {
        bool expected{ false };
        bool val = true;

        if (_lock.compare_exchange_weak(expected, val))
        {
            if (_vec.empty() == false)
            {
                const uint64_t latency = __rdtsc() - _vec[0];
                _total += (latency);
                ++_counter;
                _vec.clear();
            }

            _lock = false;
        }
    }
}

int main()
{
    std::thread t1(writer);
    std::thread t2(reader);

    t2.detach();
    t1.join();

    std::cout << _total / _counter << " cycles per op" << std::endl;
}

Using memory_order_acquire and memory_order_release:

void writer()
{
    while (_counter < LIMIT)
    {
        bool expected{ false };
        bool val = true;

        if (_lock.compare_exchange_weak(expected, val, std::memory_order_acquire))
        {
            _vec.push_back(__rdtsc());
            _lock.store(false, std::memory_order_release);
        }
    }
}

void reader()
{
    while (_counter < LIMIT)
    {
        bool expected{ false };
        bool val = true;

        if (_lock.compare_exchange_weak(expected, val, std::memory_order_acquire))
        {
            if (_vec.empty() == false)
            {
                const uint64_t latency = __rdtsc() - _vec[0];
                _total += (latency);
                ++_counter;
                _vec.clear();
            }

            _lock.store(false, std::memory_order_release);
        }
    }
}

回答1:

You don't have any protection against a thread taking the lock again right after releasing it, only to find _vec.empty() was not false, or to store another TSC value, overwriting one that was never seen by the reader. I suspect your change lets the reader waste more time blocking the writer (and vice versa), leading to less actual throughput.

TL:DR: The real problem was lack of fairness in your locking (too easy for a thread that just unlocked to be the one that wins the race to lock it again), and the way you're using that lock. (You have to take it before you can determine whether there's anything useful to do, forcing the other thread to retry, and causing extra transfers of the cache line between cores.)

Having a thread re-acquire the lock without the other thread getting a turn is always useless and wasted work, unlike many real cases where it takes more repeats to fill up or empty a queue. This is a bad producer-consumer algorithm (queue too small (size 1), and/or the reader discards all vector elements after reading vec[0]), and the worst possible locking scheme for it.

_lock.store(false, seq_cst); compiles to xchg instead of a plain mov store. It has to wait for the store buffer to drain and is just plain slow¹ (On Skylake for example, microcoded as 8 uops, throughput of one per 23 cycles for many repeated back-to-back operations, in the no-contention case where it's already hot in L1d cache. You didn't specify anything about what hardware you have).

_lock.store(false, std::memory_order_release); does just compile to a plain mov store with no extra barrier instructions. So the reload of _counter can happen in parallel with it (although branch prediction + speculative execution makes that a non-issue). And more importantly, the next CAS attempt to take the lock can actually try sooner.

There is hardware arbitration for access to a cache line when multiple cores are hammering on it, presumably with some fairness heuristics, but I don't know if the details are known.

Footnote 1: xchg is not as slow as mov+mfence on some recent CPUs, especially Skylake-derived CPUs. It is the best way to implement a seq_cst pure store on x86. But it's slower than plain mov.

You can completely solve by having your lock force alternating writer / reader

Writer waits for false, then stores true when it's done. Reader does the reverse. So the writer can never re-enter the critical section without the other thread having had a turn. (When you "wait for a value", do that read-only with a load, not a CAS. A CAS on x86 needs exclusive ownership of the cache line, preventing other threads from reading. With only one reader and one writer, you don't need any atomic RMWs for this to work.)

If you had multiple readers and multiple writers, you could have a 4-state sync variable where a writer tries to CAS it from 0 to 1, then stores 2 when it's done. Readers try to CAS from 2 to 3, then stores 0 when done.

The SPSC (single producer single consumer) case is simple:

enum lockstates { LK_WRITER=0, LK_READER=1, LK_EXIT=2 };
std::atomic<lockstates> shared_lock;
uint64_t shared_queue;  // single entry

uint64_t global_total{ 0 }, global_counter{ 0 };
static const uint64_t LIMIT = 1000000;

void writer()
{
    while(1) {
        enum lockstates lk;
        while ((lk = shared_lock.load(std::memory_order_acquire)) != LK_WRITER) {
                if (lk == LK_EXIT) 
                        return;
                else
                        SPIN;     // _mm_pause() or empty
        }

        //_vec.push_back(__rdtsc());
        shared_queue = __rdtsc();
        shared_lock.store(LK_READER, ORDER);   // seq_cst or release
    }
}

void reader()
{
    uint64_t total=0, counter=0;
    while(1) {
        enum lockstates lk;
        while ((lk = shared_lock.load(std::memory_order_acquire)) != LK_READER) {
                SPIN;       // _mm_pause() or empty
        }

        const uint64_t latency = __rdtsc() - shared_queue;  // _vec[0];
        //_vec.clear();
        total += latency;
        ++counter;
        if (counter < LIMIT) {
                shared_lock.store(LK_WRITER, ORDER);
        }else{
                break;  // must avoid storing a LK_WRITER right before LK_EXIT, otherwise writer races and can overwrite with LK_READER
        }
    }
    global_total = total;
    global_counter = counter;
    shared_lock.store(LK_EXIT, ORDER);
}

Full version on Godbolt. On my i7-6700k Skylake desktop (2-core turbo = 4200MHz, TSC = 4008MHz), compiled with clang++ 9.0.1 -O3. Data is pretty noisy, as expected; I did a bunch of runs and manually selected a low and high point, ignoring some real outlier highs that were probably due to warm-up effects.

On separate physical cores:

-DSPIN='_mm_pause()' -DORDER=std::memory_order_release: ~180 to ~210 cycles / op, basically zero machine_clears.memory_ordering (like 19 total over 1000000 ops, thanks to pause in the spin-wait loop.)
-DSPIN='_mm_pause()' -DORDER=std::memory_order_seq_cst: ~195 to ~215 ref cycles / op, same near-zero machine clears.
-DSPIN='' -DORDER=std::memory_order_release: ~195 to ~225 ref c/op, 9 to 10 M/sec machine clears without pause.
-DSPIN='' -DORDER=std::memory_order_seq_cst: more variable and slower, ~250 to ~315 c/op, 8 to 10 M/sec machine clears without pause

These timings are about 3x faster than your seq_cst "fast" original on my system. Using std::vector<> instead of a scalar might account for ~4 cycles of that; I think there was a slight effect when I replaced it. Maybe just random noise, though. 200 / 4.008GHz is about 50ns inter-core latency, which sounds about right for a quad-core "client" chip.

From the best version (mo_release, spinning on pause to avoid machine clears):

$ clang++ -Wall -g -DSPIN='_mm_pause()' -DORDER=std::memory_order_release -O3 inter-thread.cpp -pthread && 
 perf stat --all-user -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,machine_clears.memory_ordering -r4 ./a.out
195 ref cycles per op. total ticks: 195973463 / 1000000 ops
189 ref cycles per op. total ticks: 189439761 / 1000000 ops
193 ref cycles per op. total ticks: 193271479 / 1000000 ops
198 ref cycles per op. total ticks: 198413469 / 1000000 ops

 Performance counter stats for './a.out' (4 runs):

            199.83 msec task-clock:u              #    1.985 CPUs utilized            ( +-  1.23% )
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               128      page-faults               #    0.643 K/sec                    ( +-  0.39% )
       825,876,682      cycles:u                  #    4.133 GHz                      ( +-  1.26% )
        10,680,088      branches:u                #   53.445 M/sec                    ( +-  0.66% )
        44,754,875      instructions:u            #    0.05  insn per cycle           ( +-  0.54% )
       106,208,704      uops_issued.any:u         #  531.491 M/sec                    ( +-  1.07% )
        78,593,440      uops_executed.thread:u    #  393.298 M/sec                    ( +-  0.60% )
                19      machine_clears.memory_ordering #    0.094 K/sec                    ( +-  3.36% )

           0.10067 +- 0.00123 seconds time elapsed  ( +-  1.22% )

And from the worst version (mo_seq_cst, no pause): the spin-wait loop spins faster so branches and uops issued/executed are much higher, but actual useful throughput is somewhat worse.

$ clang++ -Wall -g -DSPIN='' -DORDER=std::memory_order_seq_cst -O3 inter-thread.cpp -pthread && 
 perf stat --all-user -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,machine_clears.memory_ordering -r4 ./a.out
280 ref cycles per op. total ticks: 280529403 / 1000000 ops
215 ref cycles per op. total ticks: 215763699 / 1000000 ops
282 ref cycles per op. total ticks: 282170615 / 1000000 ops
174 ref cycles per op. total ticks: 174261685 / 1000000 ops

 Performance counter stats for './a.out' (4 runs):

            207.82 msec task-clock:u              #    1.985 CPUs utilized            ( +-  4.42% )
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               130      page-faults               #    0.623 K/sec                    ( +-  0.67% )
       857,989,286      cycles:u                  #    4.129 GHz                      ( +-  4.57% )
       236,364,970      branches:u                # 1137.362 M/sec                    ( +-  2.50% )
       630,960,629      instructions:u            #    0.74  insn per cycle           ( +-  2.75% )
       812,986,840      uops_issued.any:u         # 3912.003 M/sec                    ( +-  5.98% )
       637,070,771      uops_executed.thread:u    # 3065.514 M/sec                    ( +-  4.51% )
         1,565,106      machine_clears.memory_ordering #    7.531 M/sec                    ( +- 20.07% )

           0.10468 +- 0.00459 seconds time elapsed  ( +-  4.38% )

Pinning both reader and writer to the logical cores of one physical core speeds it up a lot: on my system, cores 3 and 7 are siblings so Linux taskset -c 3,7 ./a.out stops the kernel from scheduling them anywhere else: 33 to 39 ref cycles per op, or 80 to 82 without pause.

(What will be used for data exchange between threads are executing on one Core with HT?,)

$ clang++ -Wall -g -DSPIN='_mm_pause()' -DORDER=std::memory_order_release -O3 inter-thread.cpp -pthread && 
 taskset -c 3,7 perf stat --all-user -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,machine_clears.memory_ordering -r4 ./a.out
39 ref cycles per op. total ticks: 39085983 / 1000000 ops
37 ref cycles per op. total ticks: 37279590 / 1000000 ops
36 ref cycles per op. total ticks: 36663809 / 1000000 ops
33 ref cycles per op. total ticks: 33546524 / 1000000 ops

 Performance counter stats for './a.out' (4 runs):

             89.10 msec task-clock:u              #    1.942 CPUs utilized            ( +-  1.77% )
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               128      page-faults               #    0.001 M/sec                    ( +-  0.45% )
       365,711,339      cycles:u                  #    4.104 GHz                      ( +-  1.66% )
         7,658,957      branches:u                #   85.958 M/sec                    ( +-  0.67% )
        34,693,352      instructions:u            #    0.09  insn per cycle           ( +-  0.53% )
        84,261,390      uops_issued.any:u         #  945.680 M/sec                    ( +-  0.45% )
        71,114,444      uops_executed.thread:u    #  798.130 M/sec                    ( +-  0.16% )
                16      machine_clears.memory_ordering #    0.182 K/sec                    ( +-  1.54% )

           0.04589 +- 0.00138 seconds time elapsed  ( +-  3.01% )

On logical cores sharing the same physical core. Best case ~5x lower latency than between cores, again for pause + mo_release. But the actual benchmark only completing in 40% of the time, not 20%

-DSPIN='_mm_pause()' -DORDER=std::memory_order_release: ~33 to ~39 ref cycles / op, near-zero machine_clears.memory_ordering
-DSPIN='_mm_pause()' -DORDER=std::memory_order_seq_cst: ~111 to ~113 ref cycles / op, 19 total machine clears. Surprisingly the worst!
-DSPIN='' -DORDER=std::memory_order_release: ~81 to ~84 ref cycles/op, ~12.5 M machine clears / sec.
-DSPIN='' -DORDER=std::memory_order_seq_cst: ~94 to ~96 c/op, 5 M/sec machine clears without pause.

All of these tests are with clang++ which uses xchg for seq_cst stores. g++ uses mov+mfence which is slower in the pause cases, faster without pause and with fewer machine clears. (For the hyperthread case.) Usually pretty similar for the separate cores case with pause, but faster in the separate cores seq_cst without pause case. (Again, on Skylake specifically, for this one test.)

More investigation of the original version:

Also worth checking perf counters for machine_clears.memory_ordering (Why flush the pipeline for Memory Order Violation caused by other logical processors?).

I did check on my Skylake i7-6700k, and there wasn't a significant difference in rate of machine_clears.memory_ordering per second (about 5M / sec for both the fast seq_cst and the slow release), at 4.2GHz. The "cycles per op" result is surprisingly consistent for the seq_cst version (400 to 422). My CPU's TSC reference frequency is 4008MHz, actual core frequency 4200MHz at max turbo. I assume your CPU's max turbo is a higher relative to its reference frequency than mine if you got 340-380 cycle. And/or a different microarchitecture.

But I found wildly varying results for the mo_release version: with GCC9.3.0 -O3 on Arch GNU/Linux: 5790 for one run, 2269 for another. With clang9.0.1 -O3 73346 and 7333 for two runs, yes really a factor of 10). That's a surprise. Neither version is making system calls to free/allocate memory when emptying / pushing the vector, and I'm not seeing a lot of memory-ordering machine clears from the clang version. With your original LIMIT, two runs with clang showed 1394 and 22101 cycles per op.

With clang++, even the seq_cst times are varying somewhat more than with GCC, and are higher, like 630 to 700. (g++ uses mov+mfence for seq_cst pure stores, clang++ uses xchg like MSVC does).

Other perf counters with mo_release are showing similar rates of instructions, branches, and uops per second, so I think that's an indication that the code is just spending more time spinning its wheels with the wrong thread in the critical section and the other stuck retrying.

Two perf runs, first is mo_release, second is mo_seq_cst.

$ clang++ -DORDER=std::memory_order_release -O3 inter-thread.cpp -pthread &&
 perf stat --all-user -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,machine_clears.memory_ordering -r1 ./a.out
27989 cycles per op

 Performance counter stats for './a.out':

         16,350.66 msec task-clock:u              #    2.000 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               231      page-faults               #    0.014 K/sec                  
    67,412,606,699      cycles:u                  #    4.123 GHz                    
       697,024,141      branches:u                #   42.630 M/sec                  
     3,090,238,185      instructions:u            #    0.05  insn per cycle         
    35,317,247,745      uops_issued.any:u         # 2159.989 M/sec                  
    17,580,390,316      uops_executed.thread:u    # 1075.210 M/sec                  
       125,365,500      machine_clears.memory_ordering #    7.667 M/sec                  

       8.176141807 seconds time elapsed

      16.342571000 seconds user
       0.000000000 seconds sys


$ clang++ -DORDER=std::memory_order_seq_cst -O3 inter-thread.cpp -pthread &&
 perf stat --all-user -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,machine_clears.memory_ordering -r1 ./a.out
779 cycles per op

 Performance counter stats for './a.out':

            875.59 msec task-clock:u              #    1.996 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               137      page-faults               #    0.156 K/sec                  
     3,619,660,607      cycles:u                  #    4.134 GHz                    
        28,100,896      branches:u                #   32.094 M/sec                  
       114,893,965      instructions:u            #    0.03  insn per cycle         
     1,956,774,777      uops_issued.any:u         # 2234.806 M/sec                  
     1,030,510,882      uops_executed.thread:u    # 1176.932 M/sec                  
         8,869,793      machine_clears.memory_ordering #   10.130 M/sec                  

       0.438589812 seconds time elapsed

       0.875432000 seconds user
       0.000000000 seconds sys

I modified your code with the memory order as a CPP macro so you can compile with -DORDER=std::memory_order_release to get the slow version.
acquire vs. seq_cst doesn't matter here; it compiles to the same asm on x86 for loads and atomic RMWs. Only pure stores need special asm for seq_cst.

Also you left out stdint.h and intrin.h (MSVC) / x86intrin.h (everything else). The fixed version is on Godbolt with clang and MSVC. Earlier I bumped up LIMIT by a factor of 10 to make sure the CPU frequency had time to ramp up to max turbo most most of the timed region, but reverted that change so testing mo_release would only take seconds, not minutes.

Setting the LIMIT to check for a certain total TSC cycles might help it exit in a more consistent time. That still doesn't count time where the writer is locked out, but on the whole should runs that take an extremely long time less likely.

You also have a lot of very over-complicated stuff going on if you're just trying to measure inter-thread latency.

(How does the communication between CPU happen?)

You have both threads reading a _total that the writer updates every time, instead of just storing a flag when it's all done. So the writer has potential memory-ordering machine clears from reading that variable written by another thread.

You also have an atomic RMW increment of _counter in the reader, even though that variable is private to the reader. It could be a plain non-atomic global that you read after reader.join(), or even better it could be a local variable that you only store to a global after the loop. (A plain non-atomic global would probably still end up getting stored to memory every iteration instead of kept in a register, because of the release stores. And since this is a tiny program, all the globals are probably next to each other, and likely in the same cache line.)

std::vector is also unnecessary. __rdtsc() is not going to be zero unless it wraps around the 64-bit counter¹, so you can just use 0 as a sentinel value in a scalar uint64_t to mean empty. Or if you fix your locking so the reader can't re-enter the critical section without the writer having a turn, you can remove that check.

Footnote 2: For a ~4GHz TSC reference frequency, that's 2^64 / 10^9 seconds, close enough to 2^32 seconds ~= 136 years to wrap around the TSC. Note that the TSC reference frequency is not the current core clock frequency; it's fixed to some value for a given CPU. Usually close to the rated "sticker" frequency, not max turbo.

Also, names with a leading _ are reserved at global scope in ISO C++. Don't use them for your own variables. (And generally not anywhere. You can use a trailing underscore instead if you really want.)

来源：https://stackoverflow.com/questions/61649951/c-latency-increases-when-memory-ordering-is-relaxed

标签

c++

performance

c++11

x86

memory-barriers