Atomic compare, Multi-Processor, C/C++ (Linux)

问题

I have a variable in shared memory x on a multi-processor system.

void MyFunction(volatile int* x) {
  if (*x != 0) {
     // do something
  }
}

Other processes (possibly on different processors) will be writing to x using gcc built-in atomic operations such as __sync_bool_compare_and_swap etc.

I think I'm running into some cache concurrency issues where sometimes it takes a bit of time before x finally gets updated with the new value.

What I want is a kind of atomic_compare (without the swap), if such a thing exists? Or an "atomic read". What's the fastest way to do this? (avoiding mutexes, locks, etc.)

Thanks

Edit:

I just realized a somewhat hackish workaround would be to use __sync_val_compare_and_swap with a value that I knew it could never be. Would that solve the issue? (Is there a cleaner way?)

回答1:

The new C standard, C11, has _Atomic data types and operations to deal with this. This standard is not yet implemented, but gcc and clang are close to it, they already implement the functionality. And in fact the function __sync_bool_compare_and_swap is part of it. I have wrapped that into a set of headers in P99 that let you program already with the C11 interfaces.

The C11 function to do what you want would be atomic_load or if you have particular requirements for the coherence atomic_load_explicit. And no surprise, as you suspected, P99 maps that on __sync_val_compare_and_swap(&x, 0, 0). Then if you look into the assembler that this generates on most architectures this will just translate in a simple load operation in the case of x beeing an int. But this is not guaranteed by the language, it is up to the compiler to know such things and to synthesize the instructions that are guaranteed to be atomic.

回答2:

What's the fastest way to do this? (avoiding mutexes, locks, etc.)

I'm pretty sure that you don't want to avoid mutexes. linux's futexes allow you to leverage the goodness of compare-and-swap (most of the time) while keeping the classic mutex semantic (the 'swap' that takes place is one of the mutex, not the code/data protected by it). I strongly suggest that you try them and profile the solution (perf, oprofile, VTune, etc) to see if your bottleneck is really related to the locking mechanism itself and not things like cache utilization, memory throughput, CPU cycles, IO access, remote-node memory accesses, etc.

I think I'm running into some cache concurrency issues where sometimes it takes a bit of time before x finally gets updated with the new value.

Well, let's assume you really do have a need to interact among processors and you've measured the latency hit that you get from futexes and you've determined that it won't meet your application's needs. So, if that's the case, a relatively sane way to proceed could be like so: create an array of 32-bit integers, padded out by a distance greater than or equal to the size of your target's cache line. Use the currently-executing CPU and cache line size as an index into the real values in this list (so if your cache line was 64 bytes, you would scale the CPU# by 16 to leap over the padding). You should write to these values only from the appropriate CPU and you can poll it from any other CPU (probably should call one of your CPU's "pause" instructions in the body of the busy-wait). This would be an effective mechanism for checking whether different execution threads have reached/satisfied a given condition.

I should add that this will almost certainly work (effectively trading CPU efficiency for possibly lower latencies) but remain a very brittle solution for all but a very particular set of hardware.

回答3:

What I want is a kind of atomic_compare (without the swap), if such a thing exists? Or an "atomic read".

A compare is already atomic. It's a single read.

If the latency between processors is already that bad, it seems your code would benefit from decoupling it a bit. I.e. separate out the dependencies a bit so you don't rely on this sort of communication in your inner loops.

来源：https://stackoverflow.com/questions/11270732/atomic-compare-multi-processor-c-c-linux

标签

c++

Linux

shared-memory

atomicity