Do store instructions block subsequent instructions on a cache miss?

问题

Let's say we have a processor with two cores (C0 and C1) and a cache line starting at address k that is owned by C0 initially. If C1 issues a store instruction on a 8-byte slot at line k, will that affect the throughput of the following instructions that are being executed on C1?

The intel optimziation manual has the following paragraph

When an instruction writes data to a memory location [...], the processor ensures that it has the line containing this memory location is in its L1d cache [...]. If the cache line is not there, it fetches from the next levels using a RFO request [...] RFO and storing the data happens after instruction retirement. Therefore, the store latency usually does not affect the store instruction itself

With reference to the following code,

// core c0
foo();
line(k)->at(i)->store(kConstant, std::memory_order_release);
bar();
baz();

The quote from the intel manual makes me assume that in the code above, the execution of the code will look as if the store was essentially a no-op, and would not impact the latency between the end of foo() and the start of bar(). In contrast, for the following code,

// core c0
foo();
bar(line(k)->at(i)->load(std::memory_order_acquire));
baz();

The latency between the end of foo() and the start of bar() would be impacted by the load, as the following code has the result of the load as a dependency.

This question is mostly concerned with how intel processors (in the Broadwell family or newer) work for the case above. Also, in particular, for how C++ code that looks like the above gets compiled down to assembly for those processors.

回答1:

Generally speaking, for a store that is not soon read by subsequent code, the store doesn't directly delay that subsequent code on any modern out-of-order processor, including Intel.

For example:

foo()
*x = y;
bar()

If foo() doesn't modify x or y, and bar doesn't load from *x, the store is independent and may start executing even before foo() is complete (or even before it starts), and bar() may execute before the store commits to the cache, and bar() may even execute while foo() is running, etc.

While there is little direct impact, it doesn't meant there aren't indirect impacts and indeed the store may dominate the execution time.

If the store misses in cache, it may tie up off-core resources while the cache miss is satisfied. It also usually prevent subsequent stores from draining, which may be a bottleneck: if the store buffer fills up, the front-end blocks entirely and new instructions no longer enter the scheduler.

Finally, everything depends on the details of the surrounding code, as usual. If that sequence is run repeatedly, and foo() and bar() are short, the misses related to the store may dominate the runtime. After all, buffering can't hide the cost of an unlimited number of stores. At some point you'll be bound by the intrinsic throughput of the stores.

来源：https://stackoverflow.com/questions/62419261/do-store-instructions-block-subsequent-instructions-on-a-cache-miss

标签

c++

concurrency

x86

cpu-architecture

cpu-cache