memory-barriers

Why does GCC use mov/mfence instead of xchg to implement C11's atomic_store?

那年仲夏 提交于 2021-02-18 20:56:35
问题 In C++ and Beyond 2012: Herb Sutter - atomic<> Weapons, 2 of 2 Herb Sutter argues (around 0:38:20) that one should use xchg , not mov / mfence to implement atomic_store on x86. He also seems to suggest that this particular instruction sequence is what everyone agreed one. However, GCC uses the latter. Why does GCC use this particular implementation? 回答1: Quite simply, the mov and mfence method is faster as it does not trigger a redundant memory read like the xchg which will take time. The x86

Why memory reordering is not a problem on single core/processor machines?

末鹿安然 提交于 2021-02-16 13:52:07
问题 Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions: x = 0; f = 0; Thread #1: while (f == 0); print x; Thread #2: x = 42; f = 1; I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution. However I don't understand why this is not a problem on a single core machine, with those

how are barriers/fences and acquire, release semantics implemented microarchitecturally?

半腔热情 提交于 2021-02-16 12:57:07
问题 A lot of questions SO and articles/books such as https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2018.12.08a.pdf, Preshing's articles such as https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and his entire series of articles, talk about memory ordering abstractly, in terms of the ordering and visibility guarantees provided by different barriers types. My question is how are these barriers and memory ordering semantics

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

独自空忆成欢 提交于 2021-02-10 07:12:33
问题 TL;DR: In a producer-consumer queue does it ever make sense to put an unnecessary (from C++ memory model viewpoint) memory fence, or unnecessarily strong memory order to have better latency at the expense of possibly worse throughput? C++ memory model is executed on the hardware by having some sort of memory fences for stronger memory orders and not having them on weaker memory orders. In particular, if producer does store(memory_order_release) , and consumer observes the stored value with

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

。_饼干妹妹 提交于 2021-02-10 07:11:55
问题 TL;DR: In a producer-consumer queue does it ever make sense to put an unnecessary (from C++ memory model viewpoint) memory fence, or unnecessarily strong memory order to have better latency at the expense of possibly worse throughput? C++ memory model is executed on the hardware by having some sort of memory fences for stronger memory orders and not having them on weaker memory orders. In particular, if producer does store(memory_order_release) , and consumer observes the stored value with

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

懵懂的女人 提交于 2021-02-10 07:11:01
问题 TL;DR: In a producer-consumer queue does it ever make sense to put an unnecessary (from C++ memory model viewpoint) memory fence, or unnecessarily strong memory order to have better latency at the expense of possibly worse throughput? C++ memory model is executed on the hardware by having some sort of memory fences for stronger memory orders and not having them on weaker memory orders. In particular, if producer does store(memory_order_release) , and consumer observes the stored value with

C++ latency increases when memory ordering is relaxed

情到浓时终转凉″ 提交于 2021-02-07 17:11:11
问题 I am on Windows 7 64-bit, VS2013 (x64 Release build) experimenting with memory orderings. I want to share access to a container using the fastest synchronization. I opted for atomic compare-and-swap. My program spawns two threads. A writer pushes to a vector and the reader detects this. Initially I didn't specify any memory ordering, so I assume it uses memory_order_seq_cst ? With memory_order_seq_cst the latency is 340-380 cycles per op. To try and improve performance I made stores use

Why does Unsafe.fullFence() not ensuring visibility in my example?

北城以北 提交于 2021-02-05 07:09:47
问题 I am trying to dive deep into volatile keyword in Java and setup 2 testing environments. I believe both of them are with x86_64 and use hotspot. Java version: 1.8.0_232 CPU: AMD Ryzen 7 8Core Java version: 1.8.0_231 CPU: Intel I7 Code is here: import java.lang.reflect.Field; import sun.misc.Unsafe; public class Test { private boolean flag = true; //left non-volatile intentionally private volatile int dummyVolatile = 1; public static void main(String[] args) throws Exception { Test t = new

Why does Unsafe.fullFence() not ensuring visibility in my example?

橙三吉。 提交于 2021-02-05 07:03:09
问题 I am trying to dive deep into volatile keyword in Java and setup 2 testing environments. I believe both of them are with x86_64 and use hotspot. Java version: 1.8.0_232 CPU: AMD Ryzen 7 8Core Java version: 1.8.0_231 CPU: Intel I7 Code is here: import java.lang.reflect.Field; import sun.misc.Unsafe; public class Test { private boolean flag = true; //left non-volatile intentionally private volatile int dummyVolatile = 1; public static void main(String[] args) throws Exception { Test t = new

Can memory store be reordered really, in an OoOE processor?

自古美人都是妖i 提交于 2021-02-04 16:12:48
问题 We know that two instructions can be reordered by an OoOE processor. For example, there are two global variables shared among different threads. int data; bool ready; A writer thread produce data and turn on a flag ready to allow readers to consume that data. data = 6; ready = true; Now, on an OoOE processor, these two instructions can be reordered (instruction fetch, execution). But what about the final commit/write-back of the results? i.e., will the store be in-order? From what I learned,